Skip to main content
ML & Software EngineerAkshat Vasisht
Back to Blog

Quantile Regression for Weather Risk Pricing

2 min read
Machine LearningXGBoostQuantile Regression

I built the ML components for a weather prediction market (CS 506 team project): models that generate risk-adjusted odds for temperature, wind speed, and precipitation outcomes.

Ensembles require uncorrelated errors

I started with an averaging ensemble of XGBoost, RandomForest, and LightGBM. The first implementation hit a feature count mismatch at inference (82 vs 73 features) because feature lists were defined locally in training scripts rather than serialized with the models.

After centralizing feature definitions, the ensemble achieved 5.00°F MAE -- identical to XGBoost alone. All three are gradient boosting variants trained on the same features. They fail on the same inputs, so averaging adds nothing.

Stacking with a Ridge meta-model confirmed this: weights of 0.55 (XGBoost), 0.42 (LightGBM), 0.03 (RandomForest). Introducing algorithmic diversity (Ridge, SVR, MLP alongside XGBoost) didn't help either; the meta-model assigned 80% weight to XGBoost. A single tuned model outperformed every ensemble configuration.

Point predictions to distributions

5.00°F MAE is a reasonable point forecast, but point predictions can't set odds. 77°F doesn't tell the platform whether that's a confident prediction or a volatile one. Pricing wagers requires the model to output a distribution, not a number.

XGBoost supports quantile regression natively via reg:quantileerror. I trained three models (P10, P50, P90) per target:

base_model = xgb.XGBRegressor(
    objective='reg:quantileerror',
    quantile_alpha=quantile,  # 0.1, 0.5, or 0.9
    random_state=RANDOM_SEED,
    n_jobs=-1,
    tree_method='hist'
)

The P10/P90 outputs define an 80% prediction interval. The spread (P90 - P10) is the risk signal: 4°F means tight odds, 9°F means the house widens its margin.

Feature engineering produced 90 candidates -- cyclical temporal encodings, rolling windows (3/7/14/30-day), lag features, meteorological interactions. RFECV reduced that to 11-16 per target. Hyperparameter tuning used Optuna with TimeSeriesSplit to avoid temporal leakage. The pipeline serializes twelve artifacts: one feature selector and three quantile models per target.

Final MAE: 4.35°F on a held-out test set, down from 5.00°F.

Calibration

On the held-out set, P10-P90 intervals captured actuals 81-85% of the time across all three targets (nominal: 80%). Backtesting confirmed the house margin converges to the 10% target.

The intervals also track seasonal volatility: April averages 21°F wide, July averages 10°F.

P90-P10 interval width by month for high temperature, showing wider intervals in spring and narrower intervals in summer

I later ported this pipeline to CardinalCast, a standalone Python/FastAPI implementation.

Contact

Reach Out

Happy to chat about opportunities, research, or interesting projects.

© 2026 Akshat Vasisht. All rights reserved.