BARSElo is a data-driven skill rating system for recreational dodgeball players in the Big Apple Recreational Sports (BARS) league in NYC. The project emerged from a simple question: can you rank individual player skill when win/loss records are heavily determined by randomly-assigned teams? Rather than relying on traditional win-loss records, I built a machine learning model that uses game outcomes and margin of victory to estimate underlying player skill while controlling for team composition.
The system uses historical game data (790+ games spanning Fall 2023 to present) to train Bradley-Terry variants with margin-of-victory extensions. The approach separates individual skill from team effects through batch optimization, uses regularization designed for sparse-data robustness, and validates predictions using both chronological and "new-team" evaluation modes. The result is an interactive static site with player trajectories, searchable databases, and model comparisons.
This project demonstrates the full machine learning pipeline: exploratory data analysis, model development with competing hypotheses, hyperparameter optimization, cross-validation, and deployment. It also illustrates the tension between predictive accuracy and interpretability — a model can be statistically sound while producing rankings that don't pass the "eye test" for practical use.
The rankings page includes player skill trajectories, searchable databases, team comparisons, and visualizations of different model predictions.
Full source code, model implementations, hyperparameter optimization logs, and data processing scripts.
Updated: March 28, 2026
BT-VET occasionally posted a slightly better NLL, but hyperparameter optimization often suppressed the veteran-uncertainty behavior after only a few games. In practice, the model became more volatile than the original L2 formulation and gave small-sample anomalies too much weight.
This reinforced the central tradeoff: the model with the best objective score is not always the best ranking system for this use case. If the goal is short-term prediction, extreme rookie estimates can be acceptable. If the goal is robust rankings that reward sustained evidence, stronger priors are often better. My target is the second case: reduce overfitting to anomalies while still letting genuinely strong new players rise when the data supports it.
BT-Normal is now the preferred development direction because it is more stable and philosophically aligned with the goal: requiring substantial evidence before pushing a player far from league average, without hard-coding arbitrary rookie penalties.
Like many first attempts at skill rating, I started with Elo from chess. Players begin at 1000, team rating is the roster average, and ratings update after each game from expected vs. actual outcomes. I tuned the K-factor to control movement speed.
It worked surprisingly well, around 62% win prediction accuracy. But Elo is reactive: it rewards players on currently winning teams instead of disentangling individual skill from team composition. It also ignores margin of victory, losing the difference between a 5-0 and a 3-2 game.
TrueSkill seemed perfect for team-based games. It models uncertainty and has strong Bayesian foundations, but it performed worse than Elo. The core issue was overfitting: ~800 parameters for ~800 games gave too much flexibility. Learning two parameters per player (μ and σ) was too many degrees of freedom for this dataset.
The detour was still useful. It taught me Bayesian modeling and clarified the main limitation: sequential update rules are inherently reactive, so Elo and TrueSkill both absorb momentum from current team performance.
I kept returning to the same question: how should margin of victory be incorporated correctly? I tried treating score differences as "repeated games," adjusting win probabilities by margin, and modifying uncertainty terms. None felt principled. Then I found margins were approximately Gaussian, which changed the direction.
Instead of sequential updates, I optimized all player skills simultaneously over the full game history. That became the core idea behind Bradley-Terry (BT).
Core approach: Each player has a latent skill parameter θ (theta). Team skill is the mean of player skills. I maximize the likelihood of observing all game outcomes given these parameters, with L2 regularization to prevent overfitting.
Margin breakthrough: I unified the prediction model to use a Gaussian distribution over skill differences:
The model uses scipy.optimize to estimate maximum-likelihood skill values across ~700 games, with hyperparameters for regularization strength (lambda) and margin noise (sigma).
After settling on margin-of-victory handling, I explored soccer literature, specifically the Davidson tie parameter. Soccer models often include a draw parameter because ties occur more frequently than simple models predict. I tested an analogous parameter for dodgeball.
The answer was no, but in a useful way. With the Gaussian margin formulation, the optimizer consistently pushed the Davidson tie parameter toward zero. That suggested MOV = 0 already captured games effectively tied in skill, where outcomes were mostly noise. Unlike soccer, dodgeball does not appear to have a separate structural draw effect. This removed an unnecessary degree of freedom.
BT-MOV beats Elo on predictive metrics, but the rankings do not always pass the eye test. Unexpected players can jump into the top 10. The model predicts well, but does it reflect true skill?
This tension led to dual evaluation modes. Standard chronological prediction lets reactive models look better than they are ("all players on winning teams are good"). New-team mode trains on games before a team's first appearance, then freezes the model and predicts that team's games. It is harder (56.8% vs 62.6% chrono), but it answers the right question: does skill estimation generalize to unseen team compositions?
About 240 players (~60%) have played only one season on one team, 15 games or fewer. For those players, I need to separate "this player is good" from "this team was good" with limited information.
The solution was BT-VET (Bradley-Terry with Veteran Uncertainty). It weighted games by player experience, so newer or less-frequent players carried higher uncertainty. Rookie estimates were pulled toward league average, preventing wild swings from lucky or unlucky small samples. It achieved strong new-team performance (NLL 0.664) and matched the intended philosophy: uncertainty should shrink as players accumulate games.
Next iteration: BT-Uncert — This refined uncertainty formulation addressed a ranking issue. In Bayesian systems like TrueSkill, a common ranking rule is μ - k*σ (conservative estimates), but that felt too arbitrary for this project. BT-Uncert instead ranked players by average head-to-head win probability: each player's expected win rate against every other player. This compressed skill and uncertainty into a 1D ranking that directly answers "who would win the most matchups?" The O(n²) calculation required vectorized NumPy broadcasting.
I explored TrueSkill Through Time as a follow-up, hoping a temporal Bayesian structure would handle sparse-data instability more elegantly. I could not get meaningful signal from my data in its current form.
There may still be potential there, but adopting the framework introduced real tradeoffs. Classic TrueSkill had already underperformed, and fitting fully into TTT pulled me away from choices I cared about, especially custom margin-of-victory handling. Since I already had a strong custom optimization framework, this felt more like a framework mismatch than progress.
The latest iteration is BT-Normal. I removed the BT-Uncert uncertainty term because it was not reliably de-ranking under-observed players in a stable way. A fully Bayesian uncertainty treatment may return later, but for now the priority is robust behavior with fewer moving parts.
BT-Normal replaces fixed L2 regularization with a learned Gaussian-scale prior on skill. Instead of penalizing with λΣ(skill²), the loss uses:
This reframes regularization as a normal prior with variance τ² and learns τ directly from data. Conceptually, this is cleaner than learning λ directly, since λ can be pushed toward zero and overfit.
In practice, unconstrained optimization pushed τ toward zero and collapsed skills by exploiting the objective. To prevent this, I added a Gamma-style barrier term on τ²:
I am not fully satisfied with choosing a and b as constants for now, but the approach is still principled: it does not force a specific τ value, it only prevents runaway behavior at 0 or infinity. In practice, it does what I wanted: players need more evidence before moving far from the mean, sparse-data players are naturally pulled toward league average, and strong new players can still rise when performance is consistently strong.
The project combines data pipeline, statistical modeling, hyperparameter optimization, and interactive visualization.
Data collection is semi-automated. I download HTML pages from the league's scheduling site (LeagueLobster) manually, then scripts extract game results and team rosters:
extract_games.py: Parses HTML for scored games, deduplicates by (datetime, team1, team2), sorts chronologicallyextract_teams.py: Extracts team rosters from standings pagesresolve_aliases.py: Interactive tool to handle player name variations (e.g., "Sam" vs "Samantha")Sports Elo - Games.csv (~790 games) and Sports Elo - Teams.csvmodels/base.py) with uniform API across all rating systemsupdate(), predict_win_prob(), expose()unified_config.json for reproducible experimentsThis is the core technical challenge. About 240 players (~60%) have ≤15 games on a single team. I'm trying to separate "this player is good" from "this team was good" with almost no information. The model needs to downweight sparse-data players while still acknowledging real skill differences.
Teams are fixed per season even if players stop showing up, but I have no attendance data. The model has to treat attendance patterns as part of "skill" because there's no alternative. This is obviously not ideal — a great player who misses half the games shouldn't be rated the same as someone who shows up every week.
Tournament data has been incredibly valuable — it provides cross-team matchups and validates ratings. The league's travel team (composed of top players) dominates tournaments, giving a strong signal that those players are genuinely good. But it raises the central challenge: which players on that dominant team are actually stars, and which are competent players overestimated by association?
TrueSkill taught me to count parameters. 800 parameters for 800 games equals trouble. Now I'm much more careful about model complexity relative to data size.
I can achieve 62-63% prediction accuracy, but the rankings sometimes feel wrong. Unknown players rank surprisingly high. This is the current open problem — finding principled ways to improve ranking quality without just tweaking toward personal biases.