XGBoost in Esports: How BritBets Predicts Dota 2 Drafts with Machine Learning
Technical deep-dive into BritBets' ML pipeline: 624 XGBoost features, 91 DraftNet expert features, isotonic calibration, temperature scaling, SHAP analysis, dampening stack, and market gap detection.
Why Draft Prediction Is a Well-Posed ML Problem
Dota 2's Captain Mode draft produces a fixed-size, structured input (10 hero picks + up to 14 bans across two teams) with a binary outcome (Radiant win or Dire win). There is no partial credit. The input space is discrete and combinatorially large — roughly 120-choose-10 with ordering constraints — but the outcome is heavily influenced by compositional interactions rather than individual hero strength alone. This makes it an ideal problem for gradient-boosted trees (which excel at tabular feature interactions) and attention-based neural networks (which learn pairwise relationships from data).
BritBets runs two models in production: an XGBoost classifier and a custom PyTorch neural network called DraftNet v4. Both models consume the same match context but operate on different feature representations. The XGBoost model treats the draft as a flat feature vector; DraftNet treats it as a structured sequence with per-hero embeddings. This article walks through the full pipeline from raw data to calibrated predictions, including the dampening stack that prevents overconfident outputs from reaching users.
XGBoost Feature Engineering: 624 Dimensions
The XGBoost model operates on a 624-dimensional feature vector per match. The features decompose into several groups. The first group is hero one-hot encodings: each hero gets a binary indicator for Radiant-pick and Dire-pick, producing 2 x ~124 features. These encode the raw draft composition without any domain knowledge.
The second group is hero-level statistics: rolling win rates, pick rates, and ban rates for each selected hero, computed over the trailing 30-day window from the training data cutoff. These features let the model learn meta-dependent hero strength — a hero that is strong in patch 7.41 may have been weak in 7.38.
The third group is team-level context: historical head-to-head record between the two teams (last 20 meetings), team Elo rating derived from recent match results, and league tier indicators (tier-1, tier-2, tier-3, or unknown). League tier is critical because tier-3 matches have significantly noisier outcomes and the model needs to know when to be less confident.
The fourth and most nuanced group is 56 enrichment features derived from replay analysis. These include average gold and XP differentials at 10 and 20 minutes, first blood timing, tower damage distribution, teamfight participation rates, ward placement density, Roshan attempt timing, and buyback usage frequency. When replay data is unavailable (which is about 73% of the time in production), these features are filled with league-average priors instead of zeros — a design choice that reduced production accuracy degradation by approximately 2 percentage points compared to zero-fill.
DraftNet v4 Architecture: Attention Over Heroes
DraftNet v4 is a 304K-parameter PyTorch model trained on approximately 28,000 professional matches. The architecture has three stages: per-hero encoding, team-level attention, and global classification.
In the per-hero encoding stage, each hero is represented by a 48-dimensional learned embedding concatenated with a 27-dimensional context vector. The context vector includes ability type indicators (stun, silence, root, slow, BKB-piercing), position assignments (1-5), hero win rate in the current patch, meta trend slope (is the hero rising or falling in pick rate), and several binary flags for special properties like global presence or Aegis synergy.
The team-level stage applies two layers of self-attention within each team's five heroes, followed by a cross-team attention layer. Self-attention lets the model discover synergies (for example, Magnus and Ember Spirit enable each other's combos), while cross-attention captures counter-pick dynamics (for example, Spirit Breaker's ability to punish squishy splitpush heroes). A pairwise interaction matrix between all 10 heroes is flattened and concatenated with the attention outputs.
The global classification stage concatenates the team representations with 91 expert features, 59 meta-context features, 20 synergy scores, 19 playstyle indicators, ban information, league tier, and patch version. This combined vector passes through two fully connected layers with ReLU activation, dropout, and batch normalization before producing a win probability via sigmoid. A secondary head predicts expected game duration (multi-task learning), which acts as a regularizer and improves win-prediction calibration by approximately 0.5 percentage points.
The 91 Expert Features: Encoding Dota Knowledge
The expert features are computed by a module called dota_expert.py, which encodes domain knowledge that would be difficult to learn from match outcomes alone. Each feature is a scalar between 0 and 1 representing a team-level property of the draft. Key feature groups include:
Strategy identification (teamfight strength, push potential, pickoff capability, split-push pressure, late-game scaling), power combo detection (enumerating known synergies like Void + Skywrath, Dark Seer + Sven, IO + Gyrocopter), vulnerability exploitation (checking whether one team's heroes are structurally weak against the other's — for example, heavy reliance on spell-immunity against a draft with multiple BKB-piercing disables), lane dominance scores per lane, crowd control chain length and reliability, damage type balance (physical vs. magical vs. pure, burst vs. sustained), mobility and gap-closing capability, global hero presence, high-ground defense strength, area-of-effect coverage, backline access probability, hero flexibility (how many viable position assignments each hero has), dispel availability, aura stacking value, Tormentor efficiency, kill security (ability to finish low-HP targets), recovery potential (comeback mechanics like Alchemist or Medusa), mana warfare capability, teamfight reset potential, frontline durability, vision control tools, and Roshan fight strength.
These features are deterministic — given a draft, the output is always the same. They do not require replay data, making them available for every prediction regardless of data availability. The feature importance analysis (via SHAP) shows that the expert features collectively account for approximately 30% of DraftNet's prediction variance, making them the single most important feature group after the hero embeddings themselves.
Training Pipeline and Validation
XGBoost is trained using scikit-learn's XGBClassifier with 500 estimators, max depth 6, learning rate 0.05, and column subsampling of 0.8. The model is trained on matches from the past 18 months with a time-based train/test split (most recent 2 months held out for testing). Test accuracy is approximately 77%.
DraftNet is trained with Adam optimizer (lr=3e-4), batch size 256, for 100 epochs with early stopping on validation loss (patience 10). Training runs on Railway CPU and takes approximately 3-4 hours. The model auto-saves checkpoints and the best model is pushed to GitHub. Validation accuracy is 68.8%. The lower accuracy compared to XGBoost is expected — DraftNet operates on a smaller feature set (no replay enrichment) but generalizes better to novel hero combinations not seen in training.
Both models are retrained periodically as new match data accumulates and after major patches. The N_ENRICHMENT constant must match between the feature extraction code and the trained model checkpoint — a mismatch here will cause silent prediction errors, which is one of the more dangerous failure modes in the system.
Calibration Stack: Isotonic + Temperature Scaling
Raw model outputs are poorly calibrated — a common problem with both XGBoost and neural networks trained on binary classification. BritBets applies a two-stage calibration pipeline.
Stage 1 is isotonic calibration, fitted on a held-out calibration set. Isotonic regression learns a monotone piecewise-linear mapping from raw probabilities to observed win frequencies. This corrects systematic biases like the tendency of XGBoost to cluster predictions near 0.55 and 0.45 rather than spreading them across the full probability range.
Stage 2 is temperature scaling with T=1.8. The calibrated logit is divided by the temperature before applying sigmoid, which compresses extreme predictions toward 0.5. A raw prediction of 0.80 becomes approximately 0.65 after temperature scaling. This is intentionally aggressive — the system prioritizes reliability over sharpness, because overconfident predictions lead to poor betting decisions. The temperature was selected by optimizing expected calibration error (ECE) on a validation set, not by tuning for log-loss or accuracy.
The Dampening Stack: Six Modifiers
After calibration, predictions pass through a dampening stack of six modifiers. Each modifier is logged per prediction for auditability. The modifiers are:
1. Calibration adjustment (isotonic + temperature, described above). 2. Head-to-head correction: if two teams have played 5+ recent matches, the prediction is nudged toward the historical win rate, weighted by recency. 3. Variance penalty: for matchups with high prediction variance across bootstrap samples, the prediction is compressed toward 50%. 4. Stand-in penalty: if a team is known to be fielding a substitute player, the prediction edge is reduced by a configurable amount. 5. Market gap dampening: when the AI's prediction disagrees with bookmaker odds by more than 10%, the edge is compressed. For gaps of 10-20%, compression is 50%; for gaps above 20%, compression is 80%. This reflects the empirical finding that extreme AI-market disagreement is more often explained by the AI missing information than by the AI finding alpha. 6. League tier dampening: unknown or tier-3 leagues receive a flat 20% edge compression.
The dampening stack reduced the false positive rate for value bets by approximately 40% in backtesting, at the cost of flagging fewer true value bets. This tradeoff is acceptable because a single bad value bet recommendation can erode user trust far more than a missed opportunity.
SHAP Analysis: Per-Feature Attribution
Every prediction is accompanied by a SHAP breakdown computed using the TreeExplainer for XGBoost. SHAP values decompose the prediction into additive contributions from each feature, satisfying the property that the sum of all SHAP values plus the base rate equals the predicted probability.
In production, SHAP is used for two purposes. First, it generates the human-readable explanation shown on the <a href='/predictions'>predictions page</a> — for example, identifying the top 3 features pushing toward each team and translating them into natural language. Second, it provides a diagnostic tool for model debugging: when a prediction seems wrong, SHAP values reveal whether the model is relying on a stale feature (like an outdated head-to-head record) or correctly identifying a draft imbalance.
For DraftNet, feature attribution is computed using integrated gradients rather than SHAP, since TreeExplainer does not apply to neural networks. The integrated gradients are computed with respect to the 91 expert features, providing a complementary view of what the neural network considers important. In practice, the two models often agree on the direction of individual feature contributions but disagree on magnitude — XGBoost tends to weight historical team performance more heavily, while DraftNet emphasizes compositional synergy.
Value Bet Detection and Risk Gates
The value bet detection module compares the calibrated, dampened prediction against implied probabilities derived from bookmaker odds. A value bet is flagged when the AI's edge (predicted probability minus implied probability) exceeds a threshold, subject to several risk gates.
Gate 1: League tier must be tier-1 or tier-2. Tier-3 and unknown leagues are excluded entirely. Gate 2: Maximum odds of 2.50. Historical analysis showed 0 correct predictions out of 8 at odds >= 2.50, indicating that extreme underdog predictions are unreliable. Gate 3: Minimum confidence of 62% after all dampening. Gate 4: Market gap below 15% — if the AI disagrees with the market by more than 15 percentage points, the bet is skipped rather than flagged, because gaps this large typically indicate missing context.
Performance tracking splits results by odds regime (favorites vs. underdogs) and league tier, enabling continuous evaluation of where the model adds value. You can see aggregate accuracy and profit statistics on the <a href='/stats'>track record page</a>.
Production Accuracy Gap and Future Work
The current production accuracy is approximately 66.7% on known-league matches, compared to 77% on the offline test set. This gap has several identified contributors. The largest is the replay feature null rate (73% of live matches lack enrichment data at prediction time). League-average priors mitigate this but cannot fully replace match-specific replay analysis. The second contributor is distribution shift — the test set reflects the same patch and meta as the training data, while production predictions span patch transitions and roster changes.
Planned improvements include: training a tier-1-only model with cleaner data and less noise, adding in-game live predictions using gold and XP data from the Hawk scoreboard feed, expanding to CS2 match prediction, and implementing player hero-pool depth as an additional feature dimension (already completed as of March 2026). The training and evaluation infrastructure is designed to support rapid iteration, with automated backtesting and calibration curve generation after each retrain cycle.
If you want to explore the model's behavior interactively, the <a href='https://draft.britbets.xyz'>draft simulator</a> lets you construct any draft and see DraftNet's prediction in real time. For more about the platform, visit the <a href='/'>BritBets homepage</a>.
Explore the AI prediction engine live
View Live Predictions