Tracking stops a model from hiding behind good days.

Anyone can point to a few successful selections after the event. Proper tracking records what the model supported before kickoff, then checks what happened after the match finished.

Simple version: proof tracking turns "trust me" into "here is what happened over time."

Backtesting replays old football days.

Backtesting asks what the model would have selected in the past using only information that would have been available before each match. That helps test whether the logic had value before it was shown to customers. If you are new to the market itself, start with what Over 1.5 goals means.

Good backtesting should avoid cherry-picking. It should include losing runs, different leagues, different prices, and enough fixtures to show whether a pattern is real or just noise.

Locked selections protect the proof.

A locked selection is recorded before kickoff. That matters because it prevents changing the shortlist after the result is known. If a model is going to build trust, the record needs to be created before the outcome.

GoalsProof uses the idea of official picks and proof logs so users can separate genuine pre-match selections from hindsight.

Calibration checks whether confidence is honest.

If a model says a group of selections has 80% confidence, those selections should land close to 80% over a large enough sample. If they land much lower, the model is overconfident. If they land much higher, the model may be underestimating itself.

Calibration is especially useful for confidence buckets such as Elite, Strong, Watch, and Caution because it shows whether each bucket deserves its label.

Small samples can fool everyone.

Ten good picks can be luck. Ten bad picks can also be normal variance. A model should be judged across weeks and months, with results split by league, confidence bucket, price range, and risk flag.

  • Hit rate: how often the selection landed.
  • Expected value: whether the price was attractive for the estimated chance.
  • Closing-line movement: whether the market moved toward or away from the model's view.
  • Risk flags: whether flagged selections perform worse than clean selections.

Model tracking FAQs.

Why is backtesting useful?

It replays historical fixtures to see how the model would have behaved using information available before each match.

What is a locked selection?

It is a selection recorded before kickoff so the result can be reviewed later without hindsight.

Why does sample size matter?

Small samples can mislead. Performance should be judged over enough selections to reduce random good or bad runs.