Rating Systems Compared: Choosing the Right Algorithm

Over the past few months, we’ve taken a deep dive into the world of rating systems through the lens of Elote, my Python library for implementing and comparing competitive rating algorithms. We’ve explored eight different systems:

Elo - The classic chess rating system
Glicko-1 - Elo with uncertainty tracking
Glicko-2 - Adding volatility to the equation
TrueSkill - Microsoft’s Bayesian team rating system
ECF - The British approach to chess ratings
DWZ - German engineering for chess ratings
Colley Matrix - A linear algebra approach
Ensemble - Combining multiple systems

Now it’s time to bring it all together and compare these systems side by side to help you choose the right one for your specific needs.

Comparing the Systems: A Feature Matrix

Let’s start with a high-level comparison of the key features of each system:

System	Uncertainty Tracking	Team Support	Volatility	Age Awareness	Computational Complexity	Mathematical Foundation
Elo	❌	❌	❌	❌	Low	Logistic probability
Glicko-1	✅	❌	❌	❌	Medium	Logistic probability
Glicko-2	✅	❌	✅	❌	High	Logistic probability
TrueSkill	✅	✅	❌	❌	Very High	Bayesian inference
ECF	❌	❌	❌	❌	Low	Performance-based
DWZ	❌	❌	❌	✅	Medium	Modified Elo
Colley Matrix	❌	❌	❌	❌	Medium	Linear algebra
Ensemble	Depends	Depends	Depends	Depends	High	Multiple

This table gives a quick overview, but let’s dig deeper into the strengths and weaknesses of each system.

Elo: The Reliable Classic

Strengths:

Simplicity and intuitive understanding
Minimal computational requirements
Widely recognized and accepted
Proven track record over decades
Easy to implement and maintain

Weaknesses:

No uncertainty tracking
Slow to converge for new players
Not designed for team competitions
Can suffer from rating inflation/deflation
Assumes stable skill levels

Best for: Simple one-on-one competitions where computational resources are limited and transparency is important.

Glicko-1: Elo with Confidence Intervals

Strengths:

Tracks rating reliability
Handles player inactivity gracefully
Faster convergence than Elo
Still relatively simple to understand
Provides confidence intervals

Weaknesses:

More complex than Elo
Not designed for team competitions
Doesn’t track performance volatility
Requires more computational resources
Less widely recognized than Elo

Best for: Individual competitions where players may have varying activity levels and you want to express confidence in ratings.

Glicko-2: The Complete Package

Strengths:

Tracks both reliability and volatility
Handles player inactivity gracefully
Fast convergence to accurate ratings
Provides comprehensive player profiles
Theoretically sound

Weaknesses:

Significantly more complex than Elo or Glicko-1
Higher computational requirements
Parameter tuning can be challenging
Not designed for team competitions
Less intuitive for non-technical users

Best for: Serious individual competitions where rating accuracy is critical and you have the computational resources to handle the complexity.

TrueSkill: The Team Player

Strengths:

Designed for team competitions
Handles any number of teams or players
Fast convergence with minimal data
Bayesian foundation is mathematically sound
Optimized for matchmaking

Weaknesses:

Highest computational complexity
Most difficult to explain to non-technical users
Parameter tuning can be challenging
Less widely recognized outside gaming
“Black box” feel for many users

Best for: Team-based competitions, especially in gaming contexts, where matchmaking quality is critical.

ECF: The British Pragmatist

Strengths:

Simplicity and historical precedent
Directly tied to performance
Responsive to recent results
Calculable by hand (historically)
Intuitive for players

Weaknesses:

Limited statistical foundation
Primarily designed for chess
No uncertainty tracking
Can suffer from rating drift
Less predictive than some alternatives

Best for: Chess competitions, particularly in British contexts, where tradition and simplicity are valued.

DWZ: The Age-Aware System

Strengths:

Accounts for player age and development
Adapts to different development rates
Resistant to rating inflation
Considers game count in updates
Good for junior competitions

Weaknesses:

Requires player age data
Primarily designed for chess
More complex than Elo
Less widely recognized internationally
Parameter sensitivity

Best for: Chess competitions with players of varying ages, especially those with strong junior programs.

Colley Matrix: The Mathematician’s Choice

Strengths:

Mathematically elegant solution
Order-independent (game sequence doesn’t matter)
Naturally accounts for strength of schedule
Bias-free rankings
Unique, optimal solution

Weaknesses:

Requires solving a system of equations
Only considers wins and losses, not margins
Static (doesn’t adapt to improvement/decline)
Less intuitive for non-technical users
Requires complete recalculation for updates

Best for: Complete seasons or tournaments where all results are available at once and strength of schedule is important.

Ensemble: The Best of All Worlds

Strengths:

Can combine strengths of multiple systems
Often more accurate than any single system
Flexible and customizable
More robust to edge cases
Can indicate uncertainty through disagreement

Weaknesses:

Highest complexity and parameter count
Requires running multiple systems
Finding optimal weights can be challenging
Less transparent than single systems
Potential for overfitting

Best for: Applications where maximum accuracy is critical and you have the resources to implement and tune multiple systems.

Benchmarking the Systems: A Practical Comparison

Let’s put these systems to the test with a simple benchmark. We’ll create a synthetic dataset with known “true” skills, run competitions, and see how well each system recovers the true rankings:

from elote import (
    EloCompetitor, GlickoCompetitor, Glicko2Competitor,
    TrueSkillCompetitor, ECFCompetitor, DWZCompetitor,
    ColleyMatrixCompetitor, BlendedCompetitor
)
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.metrics import kendalltau

# Create synthetic competitors with known "true" skills
n_competitors = 50
true_skills = {f"Player_{i}": 1500 + i*10 for i in range(n_competitors)}

# Function to generate matchups with some randomness
def generate_matchups(skills, n_matches=1000):
    matchups = []
    for _ in range(n_matches):
        # Select two random players
        a, b = random.sample(list(skills.keys()), 2)
        
        # Determine winner based on skill difference with some randomness
        skill_diff = skills[a] - skills[b]
        p_a_wins = 1 / (1 + 10 ** (-skill_diff / 400))
        result = random.random() < p_a_wins
        
        if result:
            matchups.append((a, b))
        else:
            matchups.append((b, a))
    
    return matchups

# Generate matchups
matchups = generate_matchups(true_skills, n_matches=5000)

# Function to run a rating system and evaluate its performance
def evaluate_system(competitor_class, matchups, true_skills, **kwargs):
    # Initialize competitors
    competitors = {name: competitor_class(**kwargs) for name in true_skills}
    
    # Process matchups
    for winner, loser in matchups:
        competitors[winner].beat(competitors[loser])

    # Get final ratings
    ratings = {name: comp.rating for name, comp in competitors.items()}
    
    # Calculate correlation with true skills
    true_ranking = sorted(true_skills.keys(), key=lambda x: true_skills[x], reverse=True)
    system_ranking = sorted(ratings.keys(), key=lambda x: ratings[x], reverse=True)
    
    # Calculate Kendall Tau rank correlation
    tau, _ = kendalltau(
        [true_ranking.index(p) for p in true_skills.keys()],
        [system_ranking.index(p) for p in true_skills.keys()]
    )
    
    return tau, ratings

# Evaluate each system
results = {}

# Elo
tau, ratings = evaluate_system(EloCompetitor, matchups, true_skills, initial_rating=1500)
results["Elo"] = (tau, ratings)

# Glicko-1
tau, ratings = evaluate_system(GlickoCompetitor, matchups, true_skills, initial_rating=1500, initial_rd=350)
results["Glicko-1"] = (tau, ratings)

# Glicko-2
tau, ratings = evaluate_system(Glicko2Competitor, matchups, true_skills)
results["Glicko-2"] = (tau, ratings)

# TrueSkill
tau, ratings = evaluate_system(TrueSkillCompetitor, matchups, true_skills)
results["TrueSkill"] = (tau, ratings)

# ECF
tau, ratings = evaluate_system(ECFCompetitor, matchups, true_skills, initial_rating=100)
results["ECF"] = (tau, ratings)

# DWZ
tau, ratings = evaluate_system(DWZCompetitor, matchups, true_skills, initial_rating=1500)
results["DWZ"] = (tau, ratings)

# Colley Matrix
tau, ratings = evaluate_system(ColleyMatrixCompetitor, matchups, true_skills)
results["Colley Matrix"] = (tau, ratings)

# Display results
print("Ranking Accuracy (Kendall Tau correlation with true skills):")
for system, (tau, _) in sorted(results.items(), key=lambda x: x[1][0], reverse=True):
    print(f"{system}: {tau:.4f}")

This benchmark gives us a quantitative measure of how well each system recovers the true rankings from noisy match results.

Choosing the Right System: Decision Factors

When selecting a rating system for your application, consider these key factors:

Competition Type: Individual vs. team competitions
Data Volume: How many competitions/matchups you expect
Computational Resources: What level of complexity you can support
User Understanding: How important is it for users to understand the system
Special Requirements: Age awareness, uncertainty tracking, etc.
Update Frequency: Real-time updates vs. batch processing
Historical Precedent: Industry standards in your domain

Here’s a decision tree to help guide your choice:

Is it a team competition?
├── Yes → Do you need Bayesian uncertainty?
│         ├── Yes → TrueSkill
│         └── No → Modified Elo for teams
└── No → Is uncertainty tracking important?
          ├── Yes → Do you need to track volatility?
          │         ├── Yes → Glicko-2
          │         └── No → Glicko-1
          └── No → Do you have special requirements?
                    ├── Age awareness → DWZ
                    ├── Batch processing → Colley Matrix
                    ├── Historical precedent (chess) → ECF
                    ├── Maximum accuracy → Ensemble
                    └── None of the above → Elo

Conclusion: No One-Size-Fits-All Solution

After exploring eight different rating systems, one thing is clear: there’s no perfect system for all situations. Each approach has its strengths and weaknesses, and the best choice depends on your specific requirements.

For simple applications where transparency and ease of implementation are paramount, Elo remains hard to beat. For more complex scenarios involving teams, uncertainty, or special requirements, the more sophisticated systems offer valuable advantages.

And for those who can’t decide or need maximum accuracy, the Ensemble approach lets you combine the strengths of multiple systems.

The beauty of Elote is that it makes all these options available through a consistent API, allowing you to experiment with different systems and find the one that works best for your specific use case.

I hope this series has given you a deeper understanding of rating systems and how they can be applied to a wide range of competitive scenarios. Whether you’re ranking chess players, sports teams, or tacos in Atlanta, there’s a rating system that’s right for your needs.

Thank you for joining me on this journey through the fascinating world of competitive ratings!

Comparing the Systems: A Feature Matrix

Elo: The Reliable Classic

Glicko-1: Elo with Confidence Intervals

Glicko-2: The Complete Package

TrueSkill: The Team Player

ECF: The British Pragmatist

DWZ: The Age-Aware System

Colley Matrix: The Mathematician’s Choice

Ensemble: The Best of All Worlds

Benchmarking the Systems: A Practical Comparison

Choosing the Right System: Decision Factors

Conclusion: No One-Size-Fits-All Solution

Stay in the loop