Rating Systems Compared: Choosing the Right Algorithm

Over the past few months, we’ve taken a deep dive into the world of rating systems through the lens of Elote, my Python library for implementing and comparing competitive rating algorithms. We’ve explored eight different systems:

  1. Elo - The classic chess rating system
  2. Glicko-1 - Elo with uncertainty tracking
  3. Glicko-2 - Adding volatility to the equation
  4. TrueSkill - Microsoft’s Bayesian team rating system
  5. ECF - The British approach to chess ratings
  6. DWZ - German engineering for chess ratings
  7. Colley Matrix - A linear algebra approach
  8. Ensemble - Combining multiple systems

Now it’s time to bring it all together and compare these systems side by side to help you choose the right one for your specific needs.

Comparing the Systems: A Feature Matrix

Let’s start with a high-level comparison of the key features of each system:

SystemUncertainty TrackingTeam SupportVolatilityAge AwarenessComputational ComplexityMathematical Foundation
EloLowLogistic probability
Glicko-1MediumLogistic probability
Glicko-2HighLogistic probability
TrueSkillVery HighBayesian inference
ECFLowPerformance-based
DWZMediumModified Elo
Colley MatrixMediumLinear algebra
EnsembleDependsDependsDependsDependsHighMultiple

This table gives a quick overview, but let’s dig deeper into the strengths and weaknesses of each system.

Elo: The Reliable Classic

Strengths:

  • Simplicity and intuitive understanding
  • Minimal computational requirements
  • Widely recognized and accepted
  • Proven track record over decades
  • Easy to implement and maintain

Weaknesses:

  • No uncertainty tracking
  • Slow to converge for new players
  • Not designed for team competitions
  • Can suffer from rating inflation/deflation
  • Assumes stable skill levels

Best for: Simple one-on-one competitions where computational resources are limited and transparency is important.

Glicko-1: Elo with Confidence Intervals

Strengths:

  • Tracks rating reliability
  • Handles player inactivity gracefully
  • Faster convergence than Elo
  • Still relatively simple to understand
  • Provides confidence intervals

Weaknesses:

  • More complex than Elo
  • Not designed for team competitions
  • Doesn’t track performance volatility
  • Requires more computational resources
  • Less widely recognized than Elo

Best for: Individual competitions where players may have varying activity levels and you want to express confidence in ratings.

Glicko-2: The Complete Package

Strengths:

  • Tracks both reliability and volatility
  • Handles player inactivity gracefully
  • Fast convergence to accurate ratings
  • Provides comprehensive player profiles
  • Theoretically sound

Weaknesses:

  • Significantly more complex than Elo or Glicko-1
  • Higher computational requirements
  • Parameter tuning can be challenging
  • Not designed for team competitions
  • Less intuitive for non-technical users

Best for: Serious individual competitions where rating accuracy is critical and you have the computational resources to handle the complexity.

TrueSkill: The Team Player

Strengths:

  • Designed for team competitions
  • Handles any number of teams or players
  • Fast convergence with minimal data
  • Bayesian foundation is mathematically sound
  • Optimized for matchmaking

Weaknesses:

  • Highest computational complexity
  • Most difficult to explain to non-technical users
  • Parameter tuning can be challenging
  • Less widely recognized outside gaming
  • “Black box” feel for many users

Best for: Team-based competitions, especially in gaming contexts, where matchmaking quality is critical.

ECF: The British Pragmatist

Strengths:

  • Simplicity and historical precedent
  • Directly tied to performance
  • Responsive to recent results
  • Calculable by hand (historically)
  • Intuitive for players

Weaknesses:

  • Limited statistical foundation
  • Primarily designed for chess
  • No uncertainty tracking
  • Can suffer from rating drift
  • Less predictive than some alternatives

Best for: Chess competitions, particularly in British contexts, where tradition and simplicity are valued.

DWZ: The Age-Aware System

Strengths:

  • Accounts for player age and development
  • Adapts to different development rates
  • Resistant to rating inflation
  • Considers game count in updates
  • Good for junior competitions

Weaknesses:

  • Requires player age data
  • Primarily designed for chess
  • More complex than Elo
  • Less widely recognized internationally
  • Parameter sensitivity

Best for: Chess competitions with players of varying ages, especially those with strong junior programs.

Colley Matrix: The Mathematician’s Choice

Strengths:

  • Mathematically elegant solution
  • Order-independent (game sequence doesn’t matter)
  • Naturally accounts for strength of schedule
  • Bias-free rankings
  • Unique, optimal solution

Weaknesses:

  • Requires solving a system of equations
  • Only considers wins and losses, not margins
  • Static (doesn’t adapt to improvement/decline)
  • Less intuitive for non-technical users
  • Requires complete recalculation for updates

Best for: Complete seasons or tournaments where all results are available at once and strength of schedule is important.

Ensemble: The Best of All Worlds

Strengths:

  • Can combine strengths of multiple systems
  • Often more accurate than any single system
  • Flexible and customizable
  • More robust to edge cases
  • Can indicate uncertainty through disagreement

Weaknesses:

  • Highest complexity and parameter count
  • Requires running multiple systems
  • Finding optimal weights can be challenging
  • Less transparent than single systems
  • Potential for overfitting

Best for: Applications where maximum accuracy is critical and you have the resources to implement and tune multiple systems.

Benchmarking the Systems: A Practical Comparison

Let’s put these systems to the test with a simple benchmark. We’ll create a synthetic dataset with known “true” skills, run competitions, and see how well each system recovers the true rankings:

from elote import (
    EloCompetitor, GlickoCompetitor, Glicko2Competitor,
    TrueSkillCompetitor, ECFCompetitor, DWZCompetitor,
    ColleyMatrixCompetitor, BlendedCompetitor
)
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.metrics import kendalltau

# Create synthetic competitors with known "true" skills
n_competitors = 50
true_skills = {f"Player_{i}": 1500 + i*10 for i in range(n_competitors)}

# Function to generate matchups with some randomness
def generate_matchups(skills, n_matches=1000):
    matchups = []
    for _ in range(n_matches):
        # Select two random players
        a, b = random.sample(list(skills.keys()), 2)
        
        # Determine winner based on skill difference with some randomness
        skill_diff = skills[a] - skills[b]
        p_a_wins = 1 / (1 + 10 ** (-skill_diff / 400))
        result = random.random() < p_a_wins
        
        if result:
            matchups.append((a, b))
        else:
            matchups.append((b, a))
    
    return matchups

# Generate matchups
matchups = generate_matchups(true_skills, n_matches=5000)

# Function to run a rating system and evaluate its performance
def evaluate_system(competitor_class, matchups, true_skills, **kwargs):
    # Initialize competitors
    competitors = {name: competitor_class(**kwargs) for name in true_skills}
    
    # Process matchups
    for winner, loser in matchups:
        competitors[winner].beat(competitors[loser])

    # Get final ratings
    ratings = {name: comp.rating for name, comp in competitors.items()}
    
    # Calculate correlation with true skills
    true_ranking = sorted(true_skills.keys(), key=lambda x: true_skills[x], reverse=True)
    system_ranking = sorted(ratings.keys(), key=lambda x: ratings[x], reverse=True)
    
    # Calculate Kendall Tau rank correlation
    tau, _ = kendalltau(
        [true_ranking.index(p) for p in true_skills.keys()],
        [system_ranking.index(p) for p in true_skills.keys()]
    )
    
    return tau, ratings

# Evaluate each system
results = {}

# Elo
tau, ratings = evaluate_system(EloCompetitor, matchups, true_skills, initial_rating=1500)
results["Elo"] = (tau, ratings)

# Glicko-1
tau, ratings = evaluate_system(GlickoCompetitor, matchups, true_skills, initial_rating=1500, initial_rd=350)
results["Glicko-1"] = (tau, ratings)

# Glicko-2
tau, ratings = evaluate_system(Glicko2Competitor, matchups, true_skills)
results["Glicko-2"] = (tau, ratings)

# TrueSkill
tau, ratings = evaluate_system(TrueSkillCompetitor, matchups, true_skills)
results["TrueSkill"] = (tau, ratings)

# ECF
tau, ratings = evaluate_system(ECFCompetitor, matchups, true_skills, initial_rating=100)
results["ECF"] = (tau, ratings)

# DWZ
tau, ratings = evaluate_system(DWZCompetitor, matchups, true_skills, initial_rating=1500)
results["DWZ"] = (tau, ratings)

# Colley Matrix
tau, ratings = evaluate_system(ColleyMatrixCompetitor, matchups, true_skills)
results["Colley Matrix"] = (tau, ratings)

# Display results
print("Ranking Accuracy (Kendall Tau correlation with true skills):")
for system, (tau, _) in sorted(results.items(), key=lambda x: x[1][0], reverse=True):
    print(f"{system}: {tau:.4f}")

This benchmark gives us a quantitative measure of how well each system recovers the true rankings from noisy match results.

Choosing the Right System: Decision Factors

When selecting a rating system for your application, consider these key factors:

  1. Competition Type: Individual vs. team competitions
  2. Data Volume: How many competitions/matchups you expect
  3. Computational Resources: What level of complexity you can support
  4. User Understanding: How important is it for users to understand the system
  5. Special Requirements: Age awareness, uncertainty tracking, etc.
  6. Update Frequency: Real-time updates vs. batch processing
  7. Historical Precedent: Industry standards in your domain

Here’s a decision tree to help guide your choice:

Is it a team competition?
├── Yes → Do you need Bayesian uncertainty?
│         ├── Yes → TrueSkill
│         └── No → Modified Elo for teams
└── No → Is uncertainty tracking important?
          ├── Yes → Do you need to track volatility?
          │         ├── Yes → Glicko-2
          │         └── No → Glicko-1
          └── No → Do you have special requirements?
                    ├── Age awareness → DWZ
                    ├── Batch processing → Colley Matrix
                    ├── Historical precedent (chess) → ECF
                    ├── Maximum accuracy → Ensemble
                    └── None of the above → Elo

Conclusion: No One-Size-Fits-All Solution

After exploring eight different rating systems, one thing is clear: there’s no perfect system for all situations. Each approach has its strengths and weaknesses, and the best choice depends on your specific requirements.

For simple applications where transparency and ease of implementation are paramount, Elo remains hard to beat. For more complex scenarios involving teams, uncertainty, or special requirements, the more sophisticated systems offer valuable advantages.

And for those who can’t decide or need maximum accuracy, the Ensemble approach lets you combine the strengths of multiple systems.

The beauty of Elote is that it makes all these options available through a consistent API, allowing you to experiment with different systems and find the one that works best for your specific use case.

I hope this series has given you a deeper understanding of rating systems and how they can be applied to a wide range of competitive scenarios. Whether you’re ranking chess players, sports teams, or tacos in Atlanta, there’s a rating system that’s right for your needs.

Thank you for joining me on this journey through the fascinating world of competitive ratings!