How Ubunifu Madness Works
A transparent look at the data, models, and methods behind every prediction.
Elo Ratings
Every team starts at 1500. After each game, the winner gains points and the loser drops by the same amount. How many points depends on three things: how likely the win was (upsets move the needle more), the margin of victory, and whether the game was at home or away. We apply a 101.9-point home court advantage and use a K-factor of 21.8 — tuned via Optuna optimization to balance responsiveness with stability.
Between seasons, every team's rating regresses 11% toward 1500. This prevents ratings from inflating over time and accounts for roster turnover. Ratings update daily from ESPN game results using the exact same formula used to process 40 years of historical games.
An average D1 team sits around 1500. Top 25 teams are typically 1800+. The #1 team is usually around 2100. The system has been validated against 4,302 tournament games from 1985 to present.
Strength of Schedule
Strength of schedule (SOS) is the average Elo rating of all opponents a team has faced during the season. A team with a high SOS has been tested against tough competition, while a low SOS suggests a softer schedule. This matters because a 25-5 record against a weak schedule is very different from 25-5 against elite opponents.
SOS is available on the Compare page and through the Madness Agent. It helps contextualize win-loss records and Elo ratings — a team with a high Elo and high SOS has proven themselves against quality opponents.
Conference Strength
Conference rankings use four metrics. Average Elo is the mean rating across all teams in the conference — it tells you overall depth. Non-Conference Win Rate counts how a conference performs against outside opponents in regular season games, removing the noise of intra-conference cannibalization. Top 5 Elo measures elite talent at the top. Parity is the inverse of Elo standard deviation — higher parity means teams are more evenly matched with no weak links.
These metrics refresh automatically as new game results come in. Non-conference win rate is the most telling — it directly measures how a conference performs against the rest of D1.
Blended 6-Signal Prediction System
Live predictions combine six independent signals into a single win probability. This blended approach outperforms any single signal alone, because each captures a different dimension of team quality:
- Static Model (30%): LR + LightGBM ensemble trained on 4,302 tournament games (1985–2025) with 31 features. Brier score 0.1413.
- Elo Ratings (30%): Real-time ratings updated daily from ESPN results. Captures current team strength.
- Momentum (15%): Last 10 games win percentage and margin of victory. Catches hot/cold streaks.
- Conference Strength (10%): 70% conference avg Elo + 30% non-conference win rate. Accounts for quality of competition.
- SOS-Adjusted Record (10%): Win percentage adjusted for strength of schedule — a 25-5 record against tough opponents is more impressive than 25-5 against weak ones.
- Efficiency (5%): Offensive vs defensive points per 100 possessions. Measures scoring quality independent of pace.
When the static model isn't available for a team (e.g., mid-majors with limited data), the remaining five live signals are re-weighted automatically. This ensures every D1 team gets a data-driven prediction, not just teams with full Kaggle coverage.
Tossup Handling
When the blended model's confidence is below 52% — meaning neither team is favored above 52% — the game is labeled a TOSSUP. This is the model being honest: a 51% prediction is barely better than a coin flip, and pretending to have a strong pick would be misleading.
Tossup games appear in yellow on the Scores page instead of showing a directional pick. They're excluded from accuracy metrics on the Performance page, so the model's reported accuracy reflects only games where it had genuine conviction. The Performance page shows how many games were tossups separately.
Static Model Details
The static model (one of six blended signals) is an ensemble of Logistic Regression and LightGBM. Each is trained on 4,302 men's and women's NCAA tournament games from 1985–2025, using 31 features organized into seven categories:
- Elo: Current rating, rating difference, expected win probability
- Conference: Average Elo, non-conference win rate, tournament historical win rate
- Four Factors: eFG%, turnover rate, offensive rebound rate, free throw rate (and opponent versions)
- Efficiency: Offensive and defensive points per 100 possessions, tempo
- Schedule: Strength of schedule (average opponent Elo)
- Massey Ordinals: Composite ranking from 15 independent ranking systems
- Momentum: Last 10 games win percentage and margin of victory
- Experience: Coach tenure, seed (when available)
The ensemble weights are 76% LR + 24% LGB, with isotonic calibration applied. LR provides stable, well-calibrated probabilities while LGB captures non-linear interactions. Together they achieve a Brier score of 0.1413, which is 43.5% better than always picking the higher seed.
Live Data Pipeline
Historical data comes from Kaggle's March Machine Learning Mania dataset, covering every D1 game from 1985 to present. This seeds the database with initial Elo ratings, conference strength, and team stats.
Once the database is populated, the app runs independently of the CSV files. A daily cron job fetches completed game scores from ESPN, computes Elo updates using our own formula, updates team records, refreshes conference strength metrics, recomputes player stats, and recalculates strength of schedule for every team. ESPN provides the raw scores — every analytical metric is computed by us.
Live scores on the Scores page come directly from ESPN's scoreboard API, enriched with our Elo ratings and model win probabilities. Crucially, every prediction is locked before tipoff — once a game starts, the pre-game prediction is frozen and never updated retroactively. This ensures honest performance tracking. After games finish, the Scores page shows whether the locked prediction was correct, along with a daily accuracy summary.
The Performance page aggregates all locked predictions into cumulative accuracy charts, daily breakdowns, calibration curves, and a game-by-game log. This gives you full transparency into how well the model actually performs over time — no cherry-picking, no retroactive changes.
Madness Agent
The chat agent uses Claude (Anthropic's AI) with tool access to query our entire database in real time. It doesn't just get a static text dump — it actively looks up data to answer your questions. The agent has six tools:
- Team Lookup: Search any D1 team by name — get Elo, record, conference, stats, momentum, coach info, and strength of schedule
- Matchup Prediction: Get head-to-head win probabilities with stat comparisons for any two teams
- Conference Analysis: Conference strength metrics, top teams in each conference
- Rankings: Top teams by Elo, filterable by conference
- Live Scores: Today's games and results from ESPN
- Upset Finder: Identify potential upsets where underdogs have meaningful win probability
When you ask "Who should I pick in a Duke vs. UNC matchup?", the agent calls the matchup prediction tool, pulls real win probabilities and stats, then explains its reasoning with specific numbers. It doesn't guess — every claim is grounded in data.
Men's and Women's Coverage
Every feature works for both men's and women's basketball. Rankings, conference strength, predictions, live scores, and the bracket builder all support a gender toggle. The model is trained on tournament games from both tournaments. Elo ratings are computed independently per gender using the same methodology.
A note on Elo scales: Men's and women's Elo ratings operate as independent pools. You may notice that top women's teams have higher raw Elo numbers than top men's teams. This reflects the different competitive dynamics in women's basketball (historically more dominant top programs like UConn, South Carolina, Stanford). The raw numbers should only be compared within the same gender — a women's Elo of 2300 and a men's Elo of 2100 both indicate elite teams at the top of their respective pools. Prediction quality is calibrated independently within each pool.
Known Limitations
- No player-level data — our model operates at the team level
- No injury adjustments — a key player being out is not reflected in predictions
- Four Factors stats are computed from Kaggle box score data (updated seasonally, not after each game)
- Women's Massey ordinals are not available from Kaggle, reducing feature coverage for women's predictions
- Early season ratings carry more uncertainty — Elo stabilizes after ~15 games
- Men's and women's Elo scales differ in magnitude — compare within gender only
Built by Richard Pallangyo for the Kaggle March ML Mania 2026 competition. Questions about methodology? The Madness Agent can explain. See also: Terms & Disclaimers.