BARSElo: Player Rating System for Recreational Dodgeball

Overview:

BARSElo is a data-driven skill rating system for recreational dodgeball players in the Big Apple Recreational Sports (BARS) league in NYC. The project emerged from a simple question: can you rank individual player skill when win/loss records are heavily determined by randomly-assigned teams? Rather than relying on traditional win-loss records, I built a machine learning model that uses game outcomes and margin of victory to estimate underlying player skill while controlling for team composition.

The system uses historical game data (790+ games spanning Fall 2023 to present) to train a Bradley-Terry model with margin-of-victory extensions. The approach separates individual skill from team effects through batch optimization, leverages Bayesian uncertainty for sparse-data players, and validates predictions using both chronological and "new-team" evaluation modes. The result is an interactive static site with player trajectories, searchable databases, and model comparisons.

This project demonstrates the full machine learning pipeline: exploratory data analysis, model development with competing hypotheses, hyperparameter optimization, cross-validation, and deployment. It also illustrates the tension between predictive accuracy and interpretability — a model can be statistically sound while producing rankings that don't pass the "eye test" for practical use.

View the Interactive Rankings

Open BARSElo Rankings

The rankings page includes player skill trajectories, searchable databases, team comparisons, and visualizations of different model predictions.

Project Stats

  • 790+ games spanning Fall 2023 – present
  • 139 unique teams
  • 400+ unique players
  • ~240 players with ≤15 games on a single team (sparse data challenge)
  • Tournament data added recently (cross-team matchups, travel team performance)
  • Manual data collection and curation

Repository

View on GitHub

Full source code, model implementations, hyperparameter optimization logs, and data processing scripts.

Tech Stack:

  • Python 3.x — primary programming language
  • scipy.optimize (L-BFGS-B) — batch optimization for maximum likelihood estimation
  • Optuna — Bayesian optimization for hyperparameter search (100-200 trials per model)
  • NumPy — numerical computation, vectorized probability calculations
  • Pandas — data processing and tabular data management
  • BeautifulSoup — HTML scraping for game and roster data
  • Plotly.js — interactive charts and player skill trajectories
  • Vanilla JavaScript — client-side interactivity
  • Git — version control (~50 commits over 4 months)
  • GitHub Pages — static site hosting

Current Performance

BT-Uncert (New-team mode — current best)
  • Mean negative log-likelihood: 0.666
  • Win prediction accuracy: 54.4%
  • Mean Brier score: 0.238
  • Cross-mode accuracy: 62.2%
BT-VET
  • Mean negative log-likelihood: 0.664 (slightly better NLL)
  • Win prediction accuracy: 56.8%
  • Mean Brier score: 0.237
  • Cross-mode accuracy: 62.6%
BT-MOV (Baseline)
  • Mean negative log-likelihood: 0.671
  • Win prediction accuracy: 54.9%
  • Mean Brier score: 0.240
  • Cross-mode accuracy: 62.1%

These metrics show meaningful improvement over Elo (~62% chrono accuracy but worse new-team generalization) and all TrueSkill variants. BT-Uncert is currently deployed for its principled ranking approach via head-to-head win probabilities, though BT-VET achieves slightly better predictive metrics.

The Journey: From Simple Elo to Bradley-Terry with Margin of Victory

Phase 1: Starting Simple with Elo

Like many initial attempts at skill rating, I started with Elo ratings from chess. The approach is straightforward: players start at 1000, team ratings are the average of their roster, and you update after each game based on expected vs. actual outcome. I tuned the K-factor to balance how quickly players could climb or fall.

Surprisingly, it worked well — around 62% win prediction accuracy. But Elo is fundamentally reactive: it rewards players on teams currently winning, rather than truly disentangling individual skill from team composition. It also ignores margin of victory, discarding information about whether a team won 5-0 or 3-2.

Phase 2: The TrueSkill Experiment

TrueSkill seemed perfect for team-based games. It was literally designed for this problem, models player uncertainty, and has nice Bayesian properties. But it performed worse than basic Elo. The problem: overfitting. With ~800 parameters for ~800 games of data, the model had too much flexibility. TrueSkill maintains two learned parameters per player (mean skill μ and uncertainty σ), which is too many degrees of freedom for my dataset.

The detour wasn't wasted — it taught me about Bayesian approaches and made me realize that sequential update rules are fundamentally limited. Elo and TrueSkill both cheat by being reactive.

Phase 3: Margin of Victory Struggles

I kept asking: how do you properly incorporate margin of victory? I tried treating score differences as "repeated games," adjusting win probabilities based on margin, modifying confidence/uncertainty — nothing felt right. Then I analyzed the empirical distribution: margins are Gaussian. That changed everything.

Phase 4: The Bradley-Terry Pivot

Instead of sequential updates, what if I optimized all player skills simultaneously using the entire game history? This became the core idea behind Bradley-Terry (BT).

Core approach: Each player has a latent skill parameter θ (theta). Team skill is the mean of player skills. I maximize the likelihood of observing all game outcomes given these parameters, with L2 regularization to prevent overfitting.

Margin breakthrough: I unified the prediction model to use a Gaussian distribution over skill differences:

  • When only win/loss is known: use the CDF (cumulative distribution function) to get P(A wins)
  • When margin is known: use the PDF (probability density function) centered at the skill difference
  • This naturally leverages more information when available without arbitrary hacks

The model uses scipy.optimize to find maximum likelihood skill values across all ~700 games, with hyperparameters for regularization strength (lambda) and margin noise (sigma).

Phase 5: The Eye Test Problem

BT-MOV beats Elo in predictive metrics, but the rankings don't always pass the eye test. Random players crop up in the top 10. The model is predictive, but is it capturing "true skill"?

This tension led to a key innovation: dual evaluation modes. Traditional chronological prediction lets models cheat by being reactive ("all players on winning teams are good!"). New-team mode trains on all games before a team's first appearance, then predicts that team's games with the model frozen. It's much harder (56.8% accuracy vs 62.6% chrono), but it answers: can the model assess player skill in a way that generalizes to unseen team compositions?

Phase 6: Bayesian Uncertainty for Sparse Data

About 240 players (~60%) have played only one season on one team — 15 games or fewer. For these players, I need to separate "this player is good" from "this team was good" with almost no information.

The solution: BT-VET (Bradley-Terry with Veteran Uncertainty). It weights games by player experience, where newer/less-frequent players have higher uncertainty. Rookie players pull estimates toward the league mean, preventing wild swings from lucky/unlucky small samples. This achieved strong new-team performance (NLL of 0.664) and aligns philosophically with what I want: uncertainty that decreases as players accumulate games.

Current evolution: BT-Uncert — A refined uncertainty formulation that addresses a key ranking challenge. In Bayesian settings like TrueSkill, the standard approach is to rank by μ - k*σ (conservative skill estimates), but this feels arbitrary and doesn't properly handle two-dimensional data. Instead, BT-Uncert ranks by average head-to-head win probability: for each player, compute their expected win rate against every other player. This compresses skill+uncertainty into a 1D ranking in a principled way that directly answers "who would win the most matchups?" The O(n²) computational cost required vectorizing all probability calculations using NumPy broadcasting.

Technical Approach

The project combines data pipeline, statistical modeling, hyperparameter optimization, and interactive visualization.

Data Pipeline

Data collection is semi-automated. I download HTML pages from the league's scheduling site (LeagueLobster) manually, then scripts extract game results and team rosters:

  • extract_games.py: Parses HTML for scored games, deduplicates by (datetime, team1, team2), sorts chronologically
  • extract_teams.py: Extracts team rosters from standings pages
  • resolve_aliases.py: Interactive tool to handle player name variations (e.g., "Sam" vs "Samantha")
  • Output: Sports Elo - Games.csv (~790 games) and Sports Elo - Teams.csv
Model Framework
  • Modular base class (models/base.py) with uniform API across all rating systems
  • Each model implements: update(), predict_win_prob(), expose()
  • Hyperparameter search via Optuna (Bayesian optimization, 100-200 trials per model)
  • Configuration-driven via unified_config.json for reproducible experiments
Current Best Model: BT-Uncert
  • Bayesian uncertainty that decreases with games played (alpha parameter)
  • Gaussian margin likelihood (sigma parameter)
  • L2 regularization on skill parameters (l2_lambda)
  • New-team mean NLL: 0.666
  • Cross-mode accuracy: 62.2%
  • Optimizes ~700 games worth of data over 139 unique teams
  • Novel ranking approach: Instead of the standard Bayesian μ - k*σ (which feels arbitrary), players are ranked by average head-to-head win probability — compressing higher-dimensional skill+uncertainty data into a principled 1D ranking. Required vectorized NumPy calculations for O(n²) efficiency.
Key Technical Decisions
  1. Batch optimization over incremental updates: Sequential update rules (Elo, TrueSkill) inherently reward players on teams currently winning. Batch optimization using all data at once better disentangles individual skill from team composition.
  2. Dual evaluation modes: Chronological validation lets models cheat by being reactive. New-team mode forces the model to generalize to unseen team compositions, which is the actual hard problem.
  3. Gaussian margin likelihood: The empirical distribution of margins is Gaussian. Using the PDF when margin is known and CDF when only outcome is known extracts more information without arbitrary weighting.
  4. L2 regularization: With 400+ players and ~800 games, overfitting is real. Regularization pulls estimates toward zero (league average), so sparse-data players don't get extreme ratings.
  5. Static site over Dash: Pre-compute ratings to JSON, serve with GitHub Pages. No backend maintenance, free hosting, instant load times.

Challenges and Learnings

Data Sparsity

This is the core technical challenge. About 240 players (~60%) have ≤15 games on a single team. I'm trying to separate "this player is good" from "this team was good" with almost no information. The model needs to downweight sparse-data players while still acknowledging real skill differences.

Attendance Confounds Everything

Teams are fixed per season even if players stop showing up, but I have no attendance data. The model has to treat attendance patterns as part of "skill" because there's no alternative. This is obviously not ideal — a great player who misses half the games shouldn't be rated the same as someone who shows up every week.

Tournament Data Adds Signal (But New Questions)

Tournament data has been incredibly valuable — it provides cross-team matchups and validates ratings. The league's travel team (composed of top players) dominates tournaments, giving a strong signal that those players are genuinely good. But it raises the central challenge: which players on that dominant team are actually stars, and which are competent players overestimated by association?

The Overfitting Lesson

TrueSkill taught me to count parameters. 800 parameters for 800 games equals trouble. Now I'm much more careful about model complexity relative to data size.

The Eye Test vs. Metrics Tension

I can achieve 62-63% prediction accuracy, but the rankings sometimes feel wrong. Unknown players rank surprisingly high. This is the current open problem — finding principled ways to improve ranking quality without just tweaking toward personal biases.

Current Performance

BT-Uncert (New-team mode — current best)
  • Mean negative log-likelihood: 0.666
  • Win prediction accuracy: 54.4%
  • Mean Brier score: 0.238
  • Cross-mode accuracy: 62.2%
BT-VET
  • Mean negative log-likelihood: 0.664 (slightly better NLL)
  • Win prediction accuracy: 56.8%
  • Mean Brier score: 0.237
  • Cross-mode accuracy: 62.6%
BT-MOV (Baseline)
  • Mean negative log-likelihood: 0.671
  • Win prediction accuracy: 54.9%
  • Mean Brier score: 0.240
  • Cross-mode accuracy: 62.1%

These metrics show meaningful improvement over Elo (~62% chrono accuracy but worse new-team generalization) and all TrueSkill variants. BT-Uncert is currently deployed for its principled ranking approach via head-to-head win probabilities, though BT-VET achieves slightly better predictive metrics.