QELO ratings for MIMIR quizzes
This post is for my MIMIR quiz friends. It may not make any sense to others.
I’ve done some analysis on MIMIR scoring across leagues and come up with an Elo like rating system, QELO, that captures full relative performance in a given MIMIR match as opposed to the currently systems that capture partial relative performance using arbitrary rules.
My assumptions are
- The distilled relative performance data (QELO score) contained in the numbers from one instance of a given set of questions (a MIMIR game or a match) can be reliably compared across other instances of the same set.
- QELO scores can be compared across weeks and averaged or aggregated as required to determine league standings even if games are randomized throughout the duration of the league. Basically, every one in a randomized league will have an equal chance of playing against Pat Gibson without affecting either their general trajectory or his.
- QELO scores can be compared across leagues with a simple weightage for the toughness of a given league, which in turn can come from the distribution of QELO scores in the league.
The QELO rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess.
MIMIR quiz leagues around the world utilise arbitrary scoring systems to capture relative skill levels. For example, one league gives win points to the top 2 players in a four player game. Another league awards win points to players who are within 3 points of the winner and so on.
The issues with these scoring systems are that
- They do not capture the full performance of those who do not make the arbitrary cut.
- They could amplify the performance of those who make their arbitrary cut.
- These arbitrary rules, when applied in the local over 6–8 weeks, can result in some weird rankings in the aggregate where competitors who ace the overarching parameter, don’t do too well on other important parameters.
To see how this plays out, arrange players in order of Win Points (one arbitrary parameter) and then in order of aggregate scores (another, less arbitrary parameter that many people believe is a better indicator of performance).
There is a significant deviation in these two lists especially as you move down tiers. There are situations where players lose many spots one way or another.
What would a more comprehensive scoring system, given the inbuilt, unavoidable, randomness in the MIMIR system, look like?
Here is my hypothesis: There are a few major factors that should determine the QELO score of a player in a MIMIR league.
- Raw Score: Owns + Bonus points.
- Competitiveness: (From 0 to 1) of the game/match. How close were the scores? Competitiveness Score = 0.5 *(2nd score/ 1st Score) + 0.3* (3rd Score/2nd Score and 0.2*(4th Score/3rd score). I’ve chosen coefficients of 0.5, 0.3 and 0.2 to weight the competitiveness of the game based on all three ratios in order of their significance. The ratio of the last two scores is important at a 0.2 weightage, but not as important as the ratio of the first two scores at 0.5.
- Completeness: (From 0 to 1) of the game/match. How many of the total points available were taken. A match that has 40/64 questions answered gets a lower weightage on this parameter compared to a match that has 58/64 questions answered.
The QELO score looks like this
Score = Raw score * Competitiveness * Completeness * 100
What are the advantages of an QELO score?
- QELO scores capture full relative performance in any given game as opposed to arbitrary cut offs that could result in loss of significant information as the weeks in a league move on.
- Any random collection of 4 quizzers across hundreds of games of the same set of questions will still result in a reliable ranking.
- If multiple leagues agree on the weightage coefficients, QELO scores can be compared across leagues.
What about 3 player games and 2 player games?
Coefficient adjustments can tackle these situations with some data loss. This needs to be tested on data from different leagues.
3 player games can use competitiveness coefficients of 0.7 for 2nd Score/1st Score ratio and 0.3 for 3rd Score/2nd Score ratio. Thus the coefficients add up to 1 and do not make assumptions about what the fourth player could have added to the mix.
2 player games can have a single competitiveness coefficient of 1.0 for 2nd score/1st score ratio.
Alternatively, a derived number from the week’s match data can be used for the 4th Score/ 3rd score ratio and the 3rd Score/2nd Score as required, while keeping the competitiveness coefficients constant at 0.5 (2/1), 0.3(3/2) and 0.3 (4/3). For example, the <xth percentile> 4th Score/ 3rd score ratios in the aggregated match data for the week could be used in place of ratios in games where was no 4th player.
Disclaimer: There are numbers used for weighting co-efficients in this QELO model. They can be tuned based on a given leagues understanding of the game. Technically, they are arbitrary, but as long as they are used to mirror real life quiz fairness, they will work well.
For example, in calculating the competitiveness score of a given game, I use this formula. Competitiveness Score = 0.5 *(2nd score/ 1st Score) + 0.3* (3rd Score/2nd Score and 0.2*(4th Score/3rd score). I’ve chosen coefficients of 0.5, 0.3 and 0.2 to weight the competitiveness of the game based on all three ratios in order of their significance. The ratio of the last two scores is important at a 0.2 weightage, but not as important as the ratio of the first two scores at 0.5.
Please create a copy of this file and use it on data sets from any quiz league you might want. Please message me your results, thoughts and misgivings, so that i can improve the model.
_________________________________
Notes for further fine tuning:
- X differential. Distribute the X differential (A given players Xs minus the lowest Xs in the game) using a formula that takes into account the player’s broad performance in the game using the percentage of total points scored, and the risk taking in the given game using percentage of Maximum possible bonus attempts that were attempted in the game.