Rating Systems Compared: Choosing the Right Algorithm
Over the past few months, we’ve taken a deep dive into the world of rating systems through the lens of Elote, my Python library for implementing and comparing competitive rating algorithms. We’ve explored eight different systems:
- Elo - The classic chess rating system
- Glicko-1 - Elo with uncertainty tracking
- Glicko-2 - Adding volatility to the equation
- TrueSkill - Microsoft’s Bayesian team rating system
- ECF - The British approach to chess ratings
- DWZ - German engineering for chess ratings
- Colley Matrix - A linear algebra approach
- Ensemble - Combining multiple systems
Now it’s time to bring it all together and compare these systems side by side to help you choose the right one for your specific needs.
Comparing the Systems: A Feature Matrix
Let’s start with a high-level comparison of the key features of each system:
| System | Uncertainty Tracking | Team Support | Volatility | Age Awareness | Computational Complexity | Mathematical Foundation |
|---|---|---|---|---|---|---|
| Elo | ❌ | ❌ | ❌ | ❌ | Low | Logistic probability |
| Glicko-1 | ✅ | ❌ | ❌ | ❌ | Medium | Logistic probability |
| Glicko-2 | ✅ | ❌ | ✅ | ❌ | High | Logistic probability |
| TrueSkill | ✅ | ✅ | ❌ | ❌ | Very High | Bayesian inference |
| ECF | ❌ | ❌ | ❌ | ❌ | Low | Performance-based |
| DWZ | ❌ | ❌ | ❌ | ✅ | Medium | Modified Elo |
| Colley Matrix | ❌ | ❌ | ❌ | ❌ | Medium | Linear algebra |
| Ensemble | Depends | Depends | Depends | Depends | High | Multiple |
This table gives a quick overview, but let’s dig deeper into the strengths and weaknesses of each system.
Elo: The Reliable Classic
Strengths:
- Simplicity and intuitive understanding
- Minimal computational requirements
- Widely recognized and accepted
- Proven track record over decades
- Easy to implement and maintain
Weaknesses:
- No uncertainty tracking
- Slow to converge for new players
- Not designed for team competitions
- Can suffer from rating inflation/deflation
- Assumes stable skill levels
Best for: Simple one-on-one competitions where computational resources are limited and transparency is important.
Glicko-1: Elo with Confidence Intervals
Strengths:
- Tracks rating reliability
- Handles player inactivity gracefully
- Faster convergence than Elo
- Still relatively simple to understand
- Provides confidence intervals
Weaknesses:
- More complex than Elo
- Not designed for team competitions
- Doesn’t track performance volatility
- Requires more computational resources
- Less widely recognized than Elo
Best for: Individual competitions where players may have varying activity levels and you want to express confidence in ratings.
Glicko-2: The Complete Package
Strengths:
- Tracks both reliability and volatility
- Handles player inactivity gracefully
- Fast convergence to accurate ratings
- Provides comprehensive player profiles
- Theoretically sound
Weaknesses:
- Significantly more complex than Elo or Glicko-1
- Higher computational requirements
- Parameter tuning can be challenging
- Not designed for team competitions
- Less intuitive for non-technical users
Best for: Serious individual competitions where rating accuracy is critical and you have the computational resources to handle the complexity.
TrueSkill: The Team Player
Strengths:
- Designed for team competitions
- Handles any number of teams or players
- Fast convergence with minimal data
- Bayesian foundation is mathematically sound
- Optimized for matchmaking
Weaknesses:
- Highest computational complexity
- Most difficult to explain to non-technical users
- Parameter tuning can be challenging
- Less widely recognized outside gaming
- “Black box” feel for many users
Best for: Team-based competitions, especially in gaming contexts, where matchmaking quality is critical.
ECF: The British Pragmatist
Strengths:
- Simplicity and historical precedent
- Directly tied to performance
- Responsive to recent results
- Calculable by hand (historically)
- Intuitive for players
Weaknesses:
- Limited statistical foundation
- Primarily designed for chess
- No uncertainty tracking
- Can suffer from rating drift
- Less predictive than some alternatives
Best for: Chess competitions, particularly in British contexts, where tradition and simplicity are valued.
DWZ: The Age-Aware System
Strengths:
- Accounts for player age and development
- Adapts to different development rates
- Resistant to rating inflation
- Considers game count in updates
- Good for junior competitions
Weaknesses:
- Requires player age data
- Primarily designed for chess
- More complex than Elo
- Less widely recognized internationally
- Parameter sensitivity
Best for: Chess competitions with players of varying ages, especially those with strong junior programs.
Colley Matrix: The Mathematician’s Choice
Strengths:
- Mathematically elegant solution
- Order-independent (game sequence doesn’t matter)
- Naturally accounts for strength of schedule
- Bias-free rankings
- Unique, optimal solution
Weaknesses:
- Requires solving a system of equations
- Only considers wins and losses, not margins
- Static (doesn’t adapt to improvement/decline)
- Less intuitive for non-technical users
- Requires complete recalculation for updates
Best for: Complete seasons or tournaments where all results are available at once and strength of schedule is important.
Ensemble: The Best of All Worlds
Strengths:
- Can combine strengths of multiple systems
- Often more accurate than any single system
- Flexible and customizable
- More robust to edge cases
- Can indicate uncertainty through disagreement
Weaknesses:
- Highest complexity and parameter count
- Requires running multiple systems
- Finding optimal weights can be challenging
- Less transparent than single systems
- Potential for overfitting
Best for: Applications where maximum accuracy is critical and you have the resources to implement and tune multiple systems.
Benchmarking the Systems: A Practical Comparison
Let’s put these systems to the test with a simple benchmark. We’ll create a synthetic dataset with known “true” skills, run competitions, and see how well each system recovers the true rankings:
from elote import (
EloCompetitor, GlickoCompetitor, Glicko2Competitor,
TrueSkillCompetitor, ECFCompetitor, DWZCompetitor,
ColleyMatrixCompetitor, BlendedCompetitor
)
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.metrics import kendalltau
# Create synthetic competitors with known "true" skills
n_competitors = 50
true_skills = {f"Player_{i}": 1500 + i*10 for i in range(n_competitors)}
# Function to generate matchups with some randomness
def generate_matchups(skills, n_matches=1000):
matchups = []
for _ in range(n_matches):
# Select two random players
a, b = random.sample(list(skills.keys()), 2)
# Determine winner based on skill difference with some randomness
skill_diff = skills[a] - skills[b]
p_a_wins = 1 / (1 + 10 ** (-skill_diff / 400))
result = random.random() < p_a_wins
if result:
matchups.append((a, b))
else:
matchups.append((b, a))
return matchups
# Generate matchups
matchups = generate_matchups(true_skills, n_matches=5000)
# Function to run a rating system and evaluate its performance
def evaluate_system(competitor_class, matchups, true_skills, **kwargs):
# Initialize competitors
competitors = {name: competitor_class(**kwargs) for name in true_skills}
# Process matchups
for winner, loser in matchups:
competitors[winner].beat(competitors[loser])
# Get final ratings
ratings = {name: comp.rating for name, comp in competitors.items()}
# Calculate correlation with true skills
true_ranking = sorted(true_skills.keys(), key=lambda x: true_skills[x], reverse=True)
system_ranking = sorted(ratings.keys(), key=lambda x: ratings[x], reverse=True)
# Calculate Kendall Tau rank correlation
tau, _ = kendalltau(
[true_ranking.index(p) for p in true_skills.keys()],
[system_ranking.index(p) for p in true_skills.keys()]
)
return tau, ratings
# Evaluate each system
results = {}
# Elo
tau, ratings = evaluate_system(EloCompetitor, matchups, true_skills, initial_rating=1500)
results["Elo"] = (tau, ratings)
# Glicko-1
tau, ratings = evaluate_system(GlickoCompetitor, matchups, true_skills, initial_rating=1500, initial_rd=350)
results["Glicko-1"] = (tau, ratings)
# Glicko-2
tau, ratings = evaluate_system(Glicko2Competitor, matchups, true_skills)
results["Glicko-2"] = (tau, ratings)
# TrueSkill
tau, ratings = evaluate_system(TrueSkillCompetitor, matchups, true_skills)
results["TrueSkill"] = (tau, ratings)
# ECF
tau, ratings = evaluate_system(ECFCompetitor, matchups, true_skills, initial_rating=100)
results["ECF"] = (tau, ratings)
# DWZ
tau, ratings = evaluate_system(DWZCompetitor, matchups, true_skills, initial_rating=1500)
results["DWZ"] = (tau, ratings)
# Colley Matrix
tau, ratings = evaluate_system(ColleyMatrixCompetitor, matchups, true_skills)
results["Colley Matrix"] = (tau, ratings)
# Display results
print("Ranking Accuracy (Kendall Tau correlation with true skills):")
for system, (tau, _) in sorted(results.items(), key=lambda x: x[1][0], reverse=True):
print(f"{system}: {tau:.4f}")
This benchmark gives us a quantitative measure of how well each system recovers the true rankings from noisy match results.
Choosing the Right System: Decision Factors
When selecting a rating system for your application, consider these key factors:
- Competition Type: Individual vs. team competitions
- Data Volume: How many competitions/matchups you expect
- Computational Resources: What level of complexity you can support
- User Understanding: How important is it for users to understand the system
- Special Requirements: Age awareness, uncertainty tracking, etc.
- Update Frequency: Real-time updates vs. batch processing
- Historical Precedent: Industry standards in your domain
Here’s a decision tree to help guide your choice:
Is it a team competition?
├── Yes → Do you need Bayesian uncertainty?
│ ├── Yes → TrueSkill
│ └── No → Modified Elo for teams
└── No → Is uncertainty tracking important?
├── Yes → Do you need to track volatility?
│ ├── Yes → Glicko-2
│ └── No → Glicko-1
└── No → Do you have special requirements?
├── Age awareness → DWZ
├── Batch processing → Colley Matrix
├── Historical precedent (chess) → ECF
├── Maximum accuracy → Ensemble
└── None of the above → Elo
Conclusion: No One-Size-Fits-All Solution
After exploring eight different rating systems, one thing is clear: there’s no perfect system for all situations. Each approach has its strengths and weaknesses, and the best choice depends on your specific requirements.
For simple applications where transparency and ease of implementation are paramount, Elo remains hard to beat. For more complex scenarios involving teams, uncertainty, or special requirements, the more sophisticated systems offer valuable advantages.
And for those who can’t decide or need maximum accuracy, the Ensemble approach lets you combine the strengths of multiple systems.
The beauty of Elote is that it makes all these options available through a consistent API, allowing you to experiment with different systems and find the one that works best for your specific use case.
I hope this series has given you a deeper understanding of rating systems and how they can be applied to a wide range of competitive scenarios. Whether you’re ranking chess players, sports teams, or tacos in Atlanta, there’s a rating system that’s right for your needs.
Thank you for joining me on this journey through the fascinating world of competitive ratings!
Stay in the loop
Get notified when I publish new posts. No spam, unsubscribe anytime.