Skip to main content

Calibration Metrics

How well do our predictions match reality?

Live Brier score, log loss, and AUC-ROC metrics computed from scored predictions across all monitored regions — updated automatically as new outcomes resolve.

Retrospective — computed on 480 backfill samples (Jan 2023 — Dec 2024). Forward-looking predictions are accumulating live data for future validation.
0.106
Brier Score
Lower is better (0 = perfect)
0.360
Log Loss
Cross-entropy measure
0.95
AUC-ROC
Discrimination (1.0 = perfect)
480
Resolved
Scored predictions

By Domain

DomainBrierLog LossAUC-ROCN
Caucasus0.0970.3540.81 *48
Iran0.2240.6400.5348
Israel_Lebanon0.1250.3880.8348
Korean_Peninsula0.0390.219N/A48
Red_Sea0.0390.211N/A48
Sahel0.0180.144N/A48
South_China_Sea0.2680.7310.4548
Taiwan_Strait0.0320.193N/A48
Ukraine0.0140.126N/A48
Venezuela0.2040.5960.7148

N/A: AUC is undefined for single-class regions (all outcomes positive or all negative). These regions have strong Brier scores but no discrimination to measure.

* AUC values with asterisk have insufficient positive samples (N < 5) for reliable estimation.

Overall AUC (0.95) is computed on pooled data across all regions. Cross-region probability variation contributes to pooled discrimination.

Understanding the Metrics

Brier Score

Measures the mean squared difference between predicted probabilities and actual outcomes. Ranges from 0 (perfect) to 1 (worst). A climatological baseline (always predicting the base rate) typically scores around 0.25. Scores below 0.1 indicate strong calibration.

Log Loss (Cross-Entropy)

Penalizes confident wrong predictions more heavily than the Brier score. A prediction of 95% for an event that does not occur is punished severely. Lower values indicate better probabilistic discrimination.

AUC-ROC

Area Under the Receiver Operating Characteristic curve. Measures the model's ability to discriminate between positive and negative outcomes regardless of the chosen threshold. 0.5 = random guessing, 1.0 = perfect discrimination.

Resolved Predictions

The number of forecasts that have reached their resolution date and been scored against ground truth. Calibration metrics are only meaningful with a sufficient sample size (N > 30).