Horse Racing Prediction Model
Probabilistic prediction and value betting strategy evaluation using logistic regression on 777K+ Betfair runners
0.789
Test AUC
0.282 / 0.279
Model / BSP Log Loss
+0.74% (95% CI crosses 0)
ROI @ 2% Margin
The Problem
The core task was to estimate each runner's pre-race win probability from public race information and compare those forecasts directly with Betfair Starting Price (BSP). That is a difficult benchmark: BSP is produced in a highly efficient exchange market, and the dataset has a base win rate of only ~10.5% across 777,549 runner-level observations. The modeling problem is therefore both heavily imbalanced and high-dimensional, spanning horse ability, jockey and trainer effects, race conditions, and pre-race market signals.
Approach
The dataset contains 30 columns of race metadata, runner attributes, and market information for 777,549 runners. Feature engineering combined log-transformed pre-race odds (`log_ltp_5min`), a repaired `official_rating` for horse ability, chronological pre-race `jockey_rating` and `trainer_rating` features, and one-hot encoding of high-cardinality runner IDs and race metadata, producing more than 64,000 sparse features after preprocessing. The data was split chronologically into 70/15/15 train, validation, and test periods. The main model was an L2-regularized logistic regression trained in a pipeline with `RandomUnderSampler` to address class imbalance, with `GridSearchCV` selecting `C=0.01` as the best regularization strength. A Random Forest was trained as a benchmark, isotonic calibration was fit on the held-out validation set, and both the calibrated model probabilities and BSP-implied probabilities were normalized within each race before market comparison and value-betting evaluation.
Results
The logistic regression beat the Random Forest on the test set, with AUC 0.789 versus 0.742, and the dominant feature in both models was log-transformed pre-race market odds. Isotonic calibration materially improved probability quality, reducing the Brier score from 0.195 to 0.082 and Expected Calibration Error from 0.294 to 0.001. Against the betting market, the calibrated model remained close but did not outperform BSP overall: model log loss was 0.282 versus 0.279 for BSP, and model AUC was 0.789 versus 0.796 for BSP. A value-betting rule using a 2% margin placed 16,737 bets and produced a marginal +0.74% ROI, but the 95% confidence interval for mean profit per bet was [-0.043, 0.058], so the result was not statistically significant. The main takeaway is methodological: careful chronological evaluation, calibration, and transparent benchmarking against a strong market baseline.
Figures


