Isotonic Calibration: Turning Raw Model Scores into Trustworthy Probabilities
Why XGBoost and LightGBM probabilities are not really probabilities, how isotonic regression fixes calibration, Brier score, reliability diagrams, and why ARIA chose isotonic over Platt scaling.
When XGBoost's predict_proba returns 0.73, most users interpret it as "73% probability." This is wrong in a specific and important way: the 0.73 is the raw output of the model's sigmoid layer, and it does not generally correspond to a 73% empirical hit rate. The model is uncalibrated. A score of 0.73 might correspond to an actual 60% hit rate or an actual 85% hit rate, you do not know without checking.
For most ML applications, miscalibration is annoying but harmless. For financial applications, where the probability feeds directly into position sizing via the Kelly criterion or into risk management via VaR, miscalibration is dangerous. A position sized using "73% confidence" that is really 60% confident is over-leveraged by a meaningful margin.
Isotonic regression is the standard fix. This article explains how it works, why it is better than the simpler alternative (Platt scaling), and how to evaluate calibration honestly.
What "calibrated" actually means
A binary classifier is perfectly calibrated if, for every probability bin p, the empirical hit rate equals p. Concretely: take all predictions where the model output was between 0.70 and 0.75. If 73% of those predictions were correct, the model is well-calibrated at that bin. If 58% were correct, the model is overconfident in that bin.
Calibration is measured by binning predictions and computing the empirical hit rate per bin. A reliability diagram plots predicted probability (x-axis) against empirical hit rate (y-axis). A perfectly calibrated model produces a diagonal line. Miscalibrated models produce curves that sag below or above the diagonal.
XGBoost and LightGBM produce models whose reliability diagrams are S-shaped, overconfident at the extremes (predictions near 0 and 1 are too extreme), underconfident in the middle. This is a known property of tree-ensemble models and has nothing to do with whether the model has predictive power. A model with great AUC can still be terribly calibrated.
Why uncalibrated models hurt in finance
The Kelly criterion for sizing a long position is approximately f* = (p × b − q) / b, where p is the probability of winning, q = 1 − p is the probability of losing, and b is the payoff ratio. For modest edges (b ≈ 1, p slightly above 0.5), the Kelly fraction is approximately 2p − 1. So if p = 0.55, Kelly says bet 10% of bankroll. If p = 0.60, Kelly says bet 20%. A 5-percentage-point error in p produces a 10-percentage-point error in position size, a 2x over-leverage.
Uncalibrated models routinely produce 5-15 percentage point errors in p across the middle of the distribution. Plugging those uncalibrated values into Kelly produces consistently over-sized positions in confident-looking bets that are actually marginal. This is the most common form of "AI investing blew up my account", the model was correct in direction but wildly miscalibrated in confidence.
Isotonic regression: the method
Isotonic regression is a non-parametric method for fitting a monotonically increasing function to data. In the calibration context, the function maps raw model scores to calibrated probabilities. The "monotonic" constraint encodes the assumption that higher raw scores should correspond to higher empirical hit rates, a reasonable property that any decent classifier already satisfies in expectation.
The algorithm, pool adjacent violators (PAV), is conceptually simple. Sort observations by predicted probability. Walk through them, computing the empirical hit rate in a sliding window. Whenever the hit rate decreases (violating monotonicity), pool the offending observations into a single bin with their average hit rate. Continue until the entire sequence is monotonic. The resulting step function is the calibrated probability map.
Isotonic regression has two desirable properties. First, it makes no parametric assumptions, it does not assume the relationship between raw score and true probability is sigmoidal, logistic, or any specific shape. It just fits whatever monotonic function the data supports. Second, it is highly flexible, the resulting function can have many bins, capturing fine-grained miscalibration that simpler methods miss.
The alternative: Platt scaling
Platt scaling fits a parametric sigmoid to map raw scores to probabilities: p_calibrated = 1 / (1 + exp(−(a × raw + b))). The two parameters a and b are fit by maximum likelihood on a held-out calibration set.
Platt is simpler than isotonic and works well when the miscalibration is approximately sigmoid-shaped (often the case for SVMs and well-trained logistic regressions). It works poorly when miscalibration is more complex, for example, when the model is overconfident at one end of the range and underconfident at the other.
For tree-ensemble models like XGBoost and LightGBM, miscalibration is generally non-sigmoidal, and isotonic outperforms Platt by a comfortable margin in terms of Brier score and log-loss. The trade-off is data efficiency: isotonic needs more calibration data to fit stably (typically at least 1,000 observations in the calibration set), while Platt works with as few as 100.
Why ARIA chose isotonic
ARIA's ML ensemble is a stacking of LightGBM and XGBoost models. Both are tree-based and produce non-sigmoidal miscalibration. Empirically, on our walk-forward validation set, isotonic regression reduces Brier score by approximately 15% over raw scores and 8% over Platt scaling. The compute cost of isotonic is negligible at inference time (a lookup in a sorted table), and we have enough calibration data (10,000+ observations per regime × horizon bundle) to fit it stably.
The output is a probability that is directly usable as a Kelly input. When ARIA says "73% probability of outperforming the index over 3 months," the historical hit rate at that score level is approximately 73%, not 60%, not 85%. Position sizing based on this number is honest.
Brier score: how to measure calibration
Brier score is the mean squared error between predicted probabilities and actual outcomes (1 for hit, 0 for miss):
Brier = (1/N) Σ (p_i − y_i)²Lower is better. A trivial baseline that always predicts the base rate has Brier roughly equal to p(1 − p), where p is the base rate. For a 50/50 base rate, the trivial baseline has Brier 0.25. A well-calibrated model with predictive power has Brier substantially below this.
Brier decomposes into reliability (calibration quality) plus resolution (how spread out the predictions are) minus uncertainty (the base rate noise floor). Improving calibration improves the reliability term directly. Improving model power improves resolution. Brier captures both.
Reliability diagrams in practice
Reliability diagrams are the standard visual check. Plot predicted probability on the x-axis and empirical hit rate on the y-axis, binned (typically 10 bins of width 0.1). Add the diagonal y = x as a reference. A well-calibrated model produces points on the diagonal. A miscalibrated model produces points that systematically deviate.
ARIA publishes reliability diagrams on the methodology page for every model bundle. The current Brier scores across our 9 regime × horizon bundles range from 0.18 (best, high-confidence regimes) to 0.24 (worst, transitional regimes). All are well below the 0.25 trivial baseline.
Some practical lessons
- Always calibrate. Even a well-performing classifier produces uncalibrated probabilities by default. The calibration step is cheap and pays for itself in any application where probability magnitudes matter.
- Use a held-out calibration set. Do not fit the isotonic regression on the same data the base model was trained on, that produces over-confident calibration that does not generalize.
- Re-fit calibration regularly. Drift in the data distribution can break calibration even when the base model is still predictive. ARIA re-fits calibration monthly.
- Report calibration metrics, not just AUC. AUC measures discrimination (can the model order correctly) but says nothing about calibration. A model can have great AUC and terrible calibration.
Conclusion
Isotonic calibration is the cheap, effective fix for the systematic miscalibration of tree-ensemble probabilities. For any application where probability magnitudes matter, position sizing, risk management, scenario analysis, it is not optional. ARIA applies it everywhere ML probabilities feed downstream decisions, and the resulting numbers are honest in a way that uncalibrated raw scores are not.
See calibrated ML probabilities on every Premium analysis in ARIA. Premium includes the full ML ensemble with confidence intervals. Start free to see basic agent scores, or read more about our scoring approach in "How AI Scores Stocks."
Frequently asked questions
When should I use Platt scaling instead of isotonic regression?
Use Platt scaling when your calibration set is small (< 500 observations) or when you have strong reason to believe the miscalibration is sigmoidal (typical for SVMs and well-regularized logistic regression). Use isotonic for everything else, especially tree-based ensembles like XGBoost, LightGBM, Random Forest, and gradient boosting variants. The data-efficiency penalty of isotonic disappears at 1,000+ calibration observations, which is easy to come by in most financial ML problems.
How do I check if my model is well-calibrated?
Three checks. First, compute Brier score and compare to the trivial baseline of p(1-p). Second, plot a reliability diagram with 10 bins and visually check that the points sit on the diagonal. Third, run a Hosmer-Lemeshow goodness-of-fit test for a formal statistical check. ARIA runs all three on every model bundle as part of the validation pipeline.
Does isotonic calibration reduce model accuracy?
No, it can only improve accuracy in terms of probability-based metrics like Brier score and log-loss. It does not change the ordering of predictions, so it has zero effect on AUC, accuracy, precision, recall, or any threshold-based metric where you only care about the rank. Calibration is a strict upgrade for probability-output applications and a no-op for rank-output applications.
Ready to put this into practice?
ARIA Analyst applies these methods on any stock, crypto, forex, commodity, or fund. Three free analyses per day on the free tier.