Quantitative Analysis

Horse Racing Prediction Model

Probabilistic prediction and value betting strategy evaluation using logistic regression on 777K+ Betfair runners

0.789

Test AUC

0.282 / 0.279

Model / BSP Log Loss

+0.74% (95% CI crosses 0)

ROI @ 2% Margin

The Problem

The core task was to estimate each runner's pre-race win probability from public race information and compare those forecasts directly with Betfair Starting Price (BSP). That is a difficult benchmark: BSP is produced in a highly efficient exchange market, and the dataset has a base win rate of only ~10.5% across 777,549 runner-level observations. The modeling problem is therefore both heavily imbalanced and high-dimensional, spanning horse ability, jockey and trainer effects, race conditions, and pre-race market signals.

Approach

The dataset contains 30 columns of race metadata, runner attributes, and market information for 777,549 runners. Feature engineering combined log-transformed pre-race odds (`log_ltp_5min`), a repaired `official_rating` for horse ability, chronological pre-race `jockey_rating` and `trainer_rating` features, and one-hot encoding of high-cardinality runner IDs and race metadata, producing more than 64,000 sparse features after preprocessing. The data was split chronologically into 70/15/15 train, validation, and test periods. The main model was an L2-regularized logistic regression trained in a pipeline with `RandomUnderSampler` to address class imbalance, with `GridSearchCV` selecting `C=0.01` as the best regularization strength. A Random Forest was trained as a benchmark, isotonic calibration was fit on the held-out validation set, and both the calibrated model probabilities and BSP-implied probabilities were normalized within each race before market comparison and value-betting evaluation.

Results

The logistic regression beat the Random Forest on the test set, with AUC 0.789 versus 0.742, and the dominant feature in both models was log-transformed pre-race market odds. Isotonic calibration materially improved probability quality, reducing the Brier score from 0.195 to 0.082 and Expected Calibration Error from 0.294 to 0.001. Against the betting market, the calibrated model remained close but did not outperform BSP overall: model log loss was 0.282 versus 0.279 for BSP, and model AUC was 0.789 versus 0.796 for BSP. A value-betting rule using a 2% margin placed 16,737 bets and produced a marginal +0.74% ROI, but the 95% confidence interval for mean profit per bet was [-0.043, 0.058], so the result was not statistically significant. The main takeaway is methodological: careful chronological evaluation, calibration, and transparent benchmarking against a strong market baseline.

Figures

Calibration curve showing before/after isotonic calibration with perfect calibration diagonal — Fig. 1 — Calibration curve on the test set before and after isotonic calibration. The raw probabilities are strongly miscalibrated, while isotonic calibration moves them much closer to the diagonal and reduces ECE from 0.294 to 0.001.

Side-by-side horizontal bar charts comparing LR coefficients and RF feature importances — Fig. 2 — Top 20 logistic-regression coefficients by absolute value (left) and Random Forest impurity-based feature importances (right). Log-transformed pre-race market odds dominate both models, with trainer and jockey ratings also contributing signal in the Random Forest rankings.

Dual-axis chart with ROI line and bet count bars across margin thresholds — Fig. 3 — Betting margin sensitivity analysis showing ROI (green line, left axis) and number of qualifying bets (bars, right axis) as the value threshold increases from 0% to 10%. ROI becomes slightly positive around the 2% to 3% range, then becomes noisy at higher thresholds as the number of bets falls.