Calibration Metrics

How well do our predictions match reality?

Live Brier score, log loss, and AUC-ROC metrics computed from scored predictions across all monitored regions — updated automatically as new outcomes resolve.

Retrospective — computed on 480 backfill samples (Jan 2023 — Dec 2024). Forward-looking predictions are accumulating live data for future validation.

0.106

Brier Score

Lower is better (0 = perfect)

0.360

Log Loss

Cross-entropy measure

0.95

AUC-ROC

Discrimination (1.0 = perfect)

480

Resolved

Scored predictions

By Domain

Domain	Brier	Log Loss	AUC-ROC	N
Caucasus	0.097	0.354	0.81 *	48
Iran	0.224	0.640	0.53	48
Israel_Lebanon	0.125	0.388	0.83	48
Korean_Peninsula	0.039	0.219	N/A	48
Red_Sea	0.039	0.211	N/A	48
Sahel	0.018	0.144	N/A	48
South_China_Sea	0.268	0.731	0.45	48
Taiwan_Strait	0.032	0.193	N/A	48
Ukraine	0.014	0.126	N/A	48
Venezuela	0.204	0.596	0.71	48

N/A: AUC is undefined for single-class regions (all outcomes positive or all negative). These regions have strong Brier scores but no discrimination to measure.

* AUC values with asterisk have insufficient positive samples (N < 5) for reliable estimation.

Overall AUC (0.95) is computed on pooled data across all regions. Cross-region probability variation contributes to pooled discrimination.

Understanding the Metrics

Brier Score

Measures the mean squared difference between predicted probabilities and actual outcomes. Ranges from 0 (perfect) to 1 (worst). A climatological baseline (always predicting the base rate) typically scores around 0.25. Scores below 0.1 indicate strong calibration.

Log Loss (Cross-Entropy)

Penalizes confident wrong predictions more heavily than the Brier score. A prediction of 95% for an event that does not occur is punished severely. Lower values indicate better probabilistic discrimination.

AUC-ROC

Area Under the Receiver Operating Characteristic curve. Measures the model's ability to discriminate between positive and negative outcomes regardless of the chosen threshold. 0.5 = random guessing, 1.0 = perfect discrimination.

Resolved Predictions

The number of forecasts that have reached their resolution date and been scored against ground truth. Calibration metrics are only meaningful with a sufficient sample size (N > 30).