BARSElo is a data-driven skill rating system for recreational dodgeball players in the Big Apple Recreational Sports (BARS) league in NYC. The project emerged from a simple question: can you rank individual player skill when win/loss records are heavily determined by randomly-assigned teams? Rather than relying on traditional win-loss records, I built a machine learning model that uses game outcomes and margin of victory to estimate underlying player skill while controlling for team composition.
The system uses historical game data (790+ games spanning Fall 2023 to present) to train a Bradley-Terry model with margin-of-victory extensions. The approach separates individual skill from team effects through batch optimization, leverages Bayesian uncertainty for sparse-data players, and validates predictions using both chronological and "new-team" evaluation modes. The result is an interactive static site with player trajectories, searchable databases, and model comparisons.
This project demonstrates the full machine learning pipeline: exploratory data analysis, model development with competing hypotheses, hyperparameter optimization, cross-validation, and deployment. It also illustrates the tension between predictive accuracy and interpretability — a model can be statistically sound while producing rankings that don't pass the "eye test" for practical use.
The rankings page includes player skill trajectories, searchable databases, team comparisons, and visualizations of different model predictions.
Full source code, model implementations, hyperparameter optimization logs, and data processing scripts.
These metrics show meaningful improvement over Elo (~62% chrono accuracy but worse new-team generalization) and all TrueSkill variants. BT-Uncert is currently deployed for its principled ranking approach via head-to-head win probabilities, though BT-VET achieves slightly better predictive metrics.
Like many initial attempts at skill rating, I started with Elo ratings from chess. The approach is straightforward: players start at 1000, team ratings are the average of their roster, and you update after each game based on expected vs. actual outcome. I tuned the K-factor to balance how quickly players could climb or fall.
Surprisingly, it worked well — around 62% win prediction accuracy. But Elo is fundamentally reactive: it rewards players on teams currently winning, rather than truly disentangling individual skill from team composition. It also ignores margin of victory, discarding information about whether a team won 5-0 or 3-2.
TrueSkill seemed perfect for team-based games. It was literally designed for this problem, models player uncertainty, and has nice Bayesian properties. But it performed worse than basic Elo. The problem: overfitting. With ~800 parameters for ~800 games of data, the model had too much flexibility. TrueSkill maintains two learned parameters per player (mean skill μ and uncertainty σ), which is too many degrees of freedom for my dataset.
The detour wasn't wasted — it taught me about Bayesian approaches and made me realize that sequential update rules are fundamentally limited. Elo and TrueSkill both cheat by being reactive.
I kept asking: how do you properly incorporate margin of victory? I tried treating score differences as "repeated games," adjusting win probabilities based on margin, modifying confidence/uncertainty — nothing felt right. Then I analyzed the empirical distribution: margins are Gaussian. That changed everything.
Instead of sequential updates, what if I optimized all player skills simultaneously using the entire game history? This became the core idea behind Bradley-Terry (BT).
Core approach: Each player has a latent skill parameter θ (theta). Team skill is the mean of player skills. I maximize the likelihood of observing all game outcomes given these parameters, with L2 regularization to prevent overfitting.
Margin breakthrough: I unified the prediction model to use a Gaussian distribution over skill differences:
The model uses scipy.optimize to find maximum likelihood skill values across all ~700 games, with hyperparameters for regularization strength (lambda) and margin noise (sigma).
BT-MOV beats Elo in predictive metrics, but the rankings don't always pass the eye test. Random players crop up in the top 10. The model is predictive, but is it capturing "true skill"?
This tension led to a key innovation: dual evaluation modes. Traditional chronological prediction lets models cheat by being reactive ("all players on winning teams are good!"). New-team mode trains on all games before a team's first appearance, then predicts that team's games with the model frozen. It's much harder (56.8% accuracy vs 62.6% chrono), but it answers: can the model assess player skill in a way that generalizes to unseen team compositions?
About 240 players (~60%) have played only one season on one team — 15 games or fewer. For these players, I need to separate "this player is good" from "this team was good" with almost no information.
The solution: BT-VET (Bradley-Terry with Veteran Uncertainty). It weights games by player experience, where newer/less-frequent players have higher uncertainty. Rookie players pull estimates toward the league mean, preventing wild swings from lucky/unlucky small samples. This achieved strong new-team performance (NLL of 0.664) and aligns philosophically with what I want: uncertainty that decreases as players accumulate games.
Current evolution: BT-Uncert — A refined uncertainty formulation that addresses a key ranking challenge. In Bayesian settings like TrueSkill, the standard approach is to rank by μ - k*σ (conservative skill estimates), but this feels arbitrary and doesn't properly handle two-dimensional data. Instead, BT-Uncert ranks by average head-to-head win probability: for each player, compute their expected win rate against every other player. This compresses skill+uncertainty into a 1D ranking in a principled way that directly answers "who would win the most matchups?" The O(n²) computational cost required vectorizing all probability calculations using NumPy broadcasting.
The project combines data pipeline, statistical modeling, hyperparameter optimization, and interactive visualization.
Data collection is semi-automated. I download HTML pages from the league's scheduling site (LeagueLobster) manually, then scripts extract game results and team rosters:
extract_games.py: Parses HTML for scored games, deduplicates by (datetime, team1, team2), sorts chronologicallyextract_teams.py: Extracts team rosters from standings pagesresolve_aliases.py: Interactive tool to handle player name variations (e.g., "Sam" vs "Samantha")Sports Elo - Games.csv (~790 games) and Sports Elo - Teams.csvmodels/base.py) with uniform API across all rating systemsupdate(), predict_win_prob(), expose()unified_config.json for reproducible experimentsThis is the core technical challenge. About 240 players (~60%) have ≤15 games on a single team. I'm trying to separate "this player is good" from "this team was good" with almost no information. The model needs to downweight sparse-data players while still acknowledging real skill differences.
Teams are fixed per season even if players stop showing up, but I have no attendance data. The model has to treat attendance patterns as part of "skill" because there's no alternative. This is obviously not ideal — a great player who misses half the games shouldn't be rated the same as someone who shows up every week.
Tournament data has been incredibly valuable — it provides cross-team matchups and validates ratings. The league's travel team (composed of top players) dominates tournaments, giving a strong signal that those players are genuinely good. But it raises the central challenge: which players on that dominant team are actually stars, and which are competent players overestimated by association?
TrueSkill taught me to count parameters. 800 parameters for 800 games equals trouble. Now I'm much more careful about model complexity relative to data size.
I can achieve 62-63% prediction accuracy, but the rankings sometimes feel wrong. Unknown players rank surprisingly high. This is the current open problem — finding principled ways to improve ranking quality without just tweaking toward personal biases.
These metrics show meaningful improvement over Elo (~62% chrono accuracy but worse new-team generalization) and all TrueSkill variants. BT-Uncert is currently deployed for its principled ranking approach via head-to-head win probabilities, though BT-VET achieves slightly better predictive metrics.