Many of you may already be aware of some of the problems with the current way vRating works. To summarize, the main concerns with vRating are the following:
PROBLEMS WITH VRATING
1. Winners in large variants, like Divided States, make jumps in their rating that are too large.
2. Headhunting is encouraged in vRating, i.e. you gain more rating from eliminating a higher rated player than a lower rated player.
3. The way players who take over a country in Civil Disorder gain or lose vRating needs improvement.
These problems have existed for a long time. How should we go about fixing them, though?
If only there was a global pandemic that would cause somebody - preferably a Diplomacy player that recently graduated in Mathematics - to sit bored at home and spend time looking into this.
Well, we are in luck, because that person is me! After some puzzling, I have derived an alternative rating system that works in much the same way as vRating, except that all of the aforementioned problems are solved.
I am not a moderator and neither do I have the programming skills to actually implement this, though. The purpose of this topic is to gain feedback from the community. Oli only wants to change things if there is enough support for it.
Below, I will address each of the three problems and
- explain the problem in detail;
- argue why I think it is definitely a problem that needs solving;
- intuitively explain how my alternative rating system solves this problem.
But first, I will give a crash course in Elo rating because this is critical in understanding the logic behind everything else I will explain.
ELO RATING: HOW DOES IT WORK?
Elo rating was originally invented by Elo to rate players in the game of chess. The logic behind Elo rating is as follows. Every player has a rating that reflects their skill in playing chess. When two players play a match of chess against each other, the player with the higher rating is more likely to win. The winner of the match will get an increase to his rating and the loser will get a decrease, and the magnitude if this in- and decrease is bigger the more unexpected the outcome of the match was. For example, if the winner had a much higher rating than the loser, then the outcome was pretty much expected so the change to the ratings of the players will be small. However, if the winner had a lower rating than the loser, that was not expected so the change to the ratings of the players will be bigger. Many people consider this to be a fair system, as you do not lose too much rating from losing against a higher rated opponent.
Above was an intuitive explanation. In Elo rating, this is formalized mathematically. You can skip this paragraph if you wish and still understand most of the rest of this text, but for those interested, here is the mathematical formalization. In Elo rating a game of chess is modelled as two players drawing a random variable from a normal distribution with the same standard deviation but a different mean. The means are equal to the ratings of the respective players. The winner is whoever draws the higher number. In this model, one can calculate the theoretical probability of either player winning. The amount of rating the loser of the match loses is proportional to the theoretical probability he would lose, and the amount of rating the winner of the match gains is proportional to 1 minus the theoretical probability he would win. So if for instance the loser has a rating much lower than the winner, he is modelled to draw from a normal distribution with a much lower mean, so his theoretical probability to draw a higher number is small, so he loses only a small amount of rating from his loss. In some Elo-like systems another probability distribution is used instead of a normal distribution.
vRating does not work the same as Elo rating. The biggest difference is that it rates players in a game with more than 2 players. However, vRating is heavily inspired on Elo rating and has the same kind of logic behind it. In 2-player matches, for instance, vRating is the same as Elo rating except that the random numbers are not drawn from a normal distribution.
Now that we have an intuitive understanding of the logic behind rating systems, let us take a look at the problems with vRating.
PROBLEM 1: TOO LARGE JUMPS FOR WINNERS OF BIG GAMES
So far you may have thought: 'But Mercy! I don't think this is a problem. If you win a, let's say, Divided States game, then you DESERVE to gain a huge amount of rating!' Maybe. But something is definitely off with these large jumps. I will show you.
Consider the following example. Alex, Bob and Charly are all playing a different Divided States game, but with the same setting and against similarly rated opponents. Alex has a vRating of 1000, Bob of 1500, and Charly of 2000. All three of them solo. What happens to their ratings? You may naively think that all of them will gain a lot of points and Charly may be the new #1 rated on the side. You'd be wrong. In all likelihood, Alex, not Charly, will be the new #1. Bobs score will fall below the score of Alex and Charlies score will fall below the score of Bob.
How is this possible? It all follows from the simple way a victory is treated in vRating. A solo in a (WTA) Divided States game counts as a simultaneous win against all other 49 participants, a bit like as if you all beat them in a 2-player match at once. Charly is much more likely to win against anybody than Alex is. For the sake of the example, let's say that the ratings are such that Charly is 5x less likely to lose against any opponent in his Divided States game than Alex is. Then by the logic of vRating, Alex's victory is 5x more unexpected and thus Alex gains 5x as much points from his victory as Charly does. I don't see any problem with this so far; for instance I think that if Alex and Charly both beat Bob in a 2-player match, it is fair to award Alex with 5x the number of points Charly gets awarded due to Alex being much lower rated than Charly. But in the case of a Divided States solo it gets interesting. Alex gains 5x the number of points Charly does. Let's say for the sake of the example that Charly gains 600 points to his vRating. Then he jumps from a rating of 2000 to a rating of 2600. But meanwhile, Alex is awarded 5 x 600 = 3000 points for the same kind of victory and so he jumps from 1000 points to 4000 points, surpassing Charly (and everyone else on the site) by a ridiculous amount!
Clearly, this is nonsensical. What is the solution?
One word: Iterate. Let me illustrate what I mean by this by providing an example of an iteration of 10. We will award Alex extra rating 10 times in a row, but each time he gets awarded only 1 tenth of what someone of his rating would normally get. This means that the first time, Alex gains 3000 / 10 = 300 points and gets a rating of 1300. The second time, he wins 1 tenth the amount of rating a 1300 player would receive from winning a Divided States match. This is evidently less than 300. Let's say for the sake of argument that it is 200. Then Alex jumps to a rating of 1300 + 200 = 1500. The third time he gets 1 tenth the rating of what a 1500 player would receive from winning a Divided States match... which is even less than 200. After the tenth time, Alex will still have a rating that is really high, and he will probably be among the top rated players, but he will not have surpassed Bob, and neither will Bob have surpassed Charly.
Note that when I say that people get awarded extra rating 10 times in a row, this is only to explain what the computer is calculating. Practically, everyone will receive new points to their vRating immediately after finishing a game. Also, I gave the example of 10x in order to make the explanation easy to follow. In reality, I think the more the better, but computing power is a limiting factor. We may also choose to iterate more the more players are in the match.
The scenarios I outlined here are not far-fetched, by the way. Slypups jumped from a rating of 2177 to a rating of 2775 from winning a WTA Europa Renovatio game. Agnaar jumped from a rating of 1612 to a rating of 3109 from also winning a WTA Europa Renovatio game. It is only a matter of time before a low rated player wins a big match and gets catapulted to far above even the highest rating Agnaar has ever had.
PROBLEM 2: HEADHUNTING
Headhunting is the act of deliberately trying to eliminate high rated players in games in order to gain more rating. Headhunting is the reason many high rated players prefer to play anonymously, or just have left the site. Some of you may be thinking: 'Come on Mercy! What is the problem here? Obviously, eliminating a high rated player is worth more than a low rated player. It is only fair if that is reflected in the ratings. High rated players should just stop whining.' Some others yet may think: 'Exactly! Headhunting is no fun and that is why rating systems are stupid.' I am here to tell you that you are both wrong. vRating really does encourage people to specifically hunt down high rated players. There is no reason, though, why any rating system should encourage any kind of behaviour other than maximizing your own score in any given game. If a rating system encourages behaviour that is any different from this, then that is a flaw of the rating system that needs to be addressed.
At this point, let me make a remark. High rated players who complain about headhunting often do not seem to realize that under vRating, there is an opposite effect, too: If you think you will be eliminated, it is best for your rating to throw your centers to the highest rated player on the board, in the hope that this high rated player will achieve a result as good as possible. The reason behind this is that losing to a high rated player cost you less of your vRating than losing to a low rated player does. In the rest of this discussion, I will focus exclusively on the type of headhunting discussed in the previous paragraph, which incentivises players to eliminate top players in some circumstances. However, all of my arguments could be extended to the 'throw to high rated players' problem, which incentivises players to help top players in some other circumstances. The new rating system I propose solves both issues.
Curiously, I didn't actually have to invent a new rating system to solve these issues. There already exists a rating system that does not have them. It is called GhostRating, it is used on webDiplomacy and it is actually mathematically far more simple than vRating (though I will not explain the math here). I do not want to adopt GhostRating, and I will later explain why, but I do think GhostRating is an illustrative example of how headhunting is not necessarily encouraged in rating systems.
Speaking about examples, suppose Alex (a different Alex than the one who just won a Divided States game) wants to play a game of Classic with six of his friends from school. They decide to play on vDiplomacy. All of them have beginner's rating, except for Bob (again, a different one), who already has a few games in vDiplomacy under his belt, and he did well in these games. Alex plays well. He and his ally (not Bob) become the dominant alliance, but his ally refuses to try to 2-way draw with Alex, afraid Alex will betray him and solo. As such, they need a third power for balance and plan to end the game in a 3-way draw. (I know they are boring carebears. It is just an example.) Alex needs to make a decision whether he wants this third player to be Bob or not. If Alex eliminates Bob, he gains more rating than if he decides to draw with Bob, even though in both cases Alex ends the game in a 3-way draw. Therefore, Alex is encouraged to eliminate Bob. Curiously, if Alex decided to draw with Bob, then the rating of Bob becomes completely irrelevant for Alex's gain in rating. Bob may have a vRating of 0 or 3000: in both cases, Alex will gain the same rating from the game.
But if this entire scenario played out on webDiplomacy, things would be reversed. If Alex already knows he will end the game in a 3-way draw, it is irrelevant for his gain in rating whether Bob also draws or gets eliminated. What is not irrelevant, though, is Bobs rating, under any circumstance. The mere fact that a higher rated player like Bob is in the same game as Alex means that Alex will get more points from his 3-way draw, even if Bob is included in the draw.
I will argue why the scenario under GhostRating on webDiplomacy makes sense and under vRating on vDiplomacy does not. I think we can all agree that the presence of a high rated player in your game makes it more difficult to obtain a good result. Therefore, this presence should mean that you gain more points from getting a good result. GhostRating indeed gives you more points if you are in a game with a high rated player, but vRating does not; it only does if you achieved a different result than this high rated player. On this front, GhostRating is therefore better. On the other hand, if you draw, GhostRating does not care if you draw with the high rated player or if the high rated player is one of the eliminated players, but vRating does care. Some of you may think that this makes vRating better. After all, eliminating a high rated player is more difficult than drawing with one, so shouldn't you be rewarded for it? No, I strongly disagree. Eliminating players in alphabetical order is also more difficult than not doing so. Why then don't we award players for eliminating other players in alphabetical order? Because something that does not affect your score in a given game should NOT affect your rating as a player! So it is for eliminating (or helping) specific players because of their rating, as well.
I will proceed giving an intuitive explanation as to why vRating and GhostRating function so differently. vRating models winning any game as winning a 2-player game against all the other players on the board, and drawing as winning a 2-player game against all the eliminated players, with each 2-player game having a smaller weight than they would have if you won. In each fictitious 2-player game vRating looks at how unexpected the result is (a low rated player is unlikely to beat a high rated player) and awards more points to the winner the more unexpected the win was. GhostRating functions radically different. It looks at the game as a whole. Based on the ratings of the players in the game, it calculates the expected scores of all the players. For example, the highest rated player in a game of Classic can be expected to, on average, win more than 1 seventh of the pot. At the end of the game, it compares the actual scores to the expected scores, and awards rating proportional to the difference. For example, if your score is higher than would be expected from your rating, you get points, and if your score is a lot higher, you get a lot of points. I think looking at the game as a whole, like GhostRating does, is the correct way to do it.
However, GhostRating has one curious feature. Remember when I said that in a game of Classic, under GhostRating a high rated player is expected to win more than 1 seventh of the pot? This means that if a game of Classic ends in a 7-way draw, the higher rated players in the game lose GhostRating and the lower ranked players gain GhostRating. Under vRating, though, no one gains or loses any rating, as no one achieved a better result than anyone else. More generally, one can never lose vRating in a draw; but if a high rated player is part of a large draw, that player will lose rating. To tell you all a little anecdote: If I remember correctly, a few years ago the #1 rated player on webDiplomacy (VillageIdiot) temporarily lost his #1 spot because he was forced to accept a 4-way draw in a game (of Classic), and that cost him a lot of rating.
I do not mind this feature of GhostRating. To the contrary: I like it! It forces high rated players to actually try and do their very best to go for the solo if they want to keep their high rating. However, the new rating system I propose does not have this feature. More specifically, under the rating system I propose it is not possible to lose rating if no one in the game got a better result than you. Why didn't I just adopt the way GhostRating works? There are two reasons.
1. Any new rating system we introduce can and probably will be used retroactively. If we introduce a rating system that is too much different from the one we already had, this can mean that playstyles that worked in the past do no longer work, and that would be unfair to players. Plus, this point may cause the community to be divided. If I suggest a rating system that is basically the same as vRating except that it fixes some obvious errors, then probably almost everybody will be in favor of all the changes and will support the new rating system being introduced.
2. A rating system in which points are never lost in a draw probably translates better across multiple variants. In some variants a larger number of players tend to get eliminated than in others. If we would just have Ghostrating on vDip, then high rated players would have an incentive to only play on maps that see many eliminations.
Most of my thinking went into finding a solution to the headhunting problem that does still keep key features of vRating. How does my new rating system achieve this? It is complicated and mathematically quite involved. Here is an intuitive explanation.
The rating system looks at the ratings of the players in a match and the way the match finished, for example '3-way draw'. Then it tries to guess which player got which result based on their rating. For example, if there is a 3-way draw, then there is a big chance the highest rated player in the game was a part of it. Then it compares the actual scores of the players based on the scores that were guessed. The rating a player gains or loses is proportional to the difference between his actual score and the score that was expected/guessed. The key difference with GhostRating is that GhostRating does not take into account how the match ended before guessing the players scores, but my proposed rating system does.
I have to admit, though, that the above paragraph is a lie. This is almost how my proposed rating system works. In reality, I approximate something like this with a little trick. I can only explain this by going deeper into the mathematics. So for those interested, see the explanation in the next paragraph.
The scores of the game gets sorted from highest to lowest. For instance, in case of a 3-way draw the first three scores are 1/3 of the pot, and the rest are 0. We model the game as all of the players drawing a random variable from a probability distribution centered around their vRating. There exists a number (let's call it x) such that the expected number of players to draw a number higher than x is equal to 1. We make a model in which any player who draws a number higher than x gets the highest score (in case of a 3-way draw, this is 1/3 of the pot). There exists another number y such that the expected number of players to draw a number higher than y is equal to 2. In our model, any player who draws a number higher than y (but lower than x) gets the second highest score (which is also equal to 1/3 of the pot in case of a 3-way draw). Etcetera. In the end we compare the actual scores to the expected scores that followed from our model, and the change in vRating will be proportional to the difference.
PROBLEM 3: TAKEOVERS OF COUNTRIES IN CIVIL DISORDER
In order to talk about this subject well, we first need to define the concept of the 'worth' of a position. The worth of a position is equal to the number of centers from that position divided by the average number of centers of all the countries (including defeated ones) on the board. For example, if you have as many centers as everybody else, then your worth is 1. If you have twice as many centers as average, then your worth is 2, etc.
Previously on this site, it was not free to take over a position in Civil Disorder. You had to pay some dPoints. This was equal to 0.5 x the bet size of the variant x the worth of the position. The reason we multiplied with 0.5 was because we wanted there to be an incentive to take over positions. Its effect practically was that e.g. when you took over a position in a game that was still in its first year, you only had to pay half as much compared to the players who had been in the game from the start. Please remember this number 0.5 and what it used to be used for.
Let us now focus on the way taking over a country in CD affects your vRating. My information on this comes from tobi1, who wrote about this before in this thread: https://vdiplomacy.net/forum.php?viewthread=84011#84011
Tobi1 ends his post by stating that he does not know whether this works as intended. I go on a limb and claim that it does not work as intended. I will state one simple example.
Consider yet another Alex and Bob. Both have the same vRating and both take over a country in CD in the same game at the same time. Alex takes over a country with a worth of 1 and Bob takes over a country with a worth of 2, so Bob starts out with twice as many centers as Alex. Both play well and end up in the draw. Both gain vRating, but one of them gains more than the other. Can you guess which one of them gains more? Bob, of course. If this example made sense, I would not have presented it.
I admit that one can also come up with examples where the current system does work with regards to takeovers. However, the changes that need to be made in order to eliminate headhunting are from such a nature that we also need to change the way takeovers of powers in CD are treated, anyway. So let me explain how my new system treats it. My explanation is best understood if you have read everything so far and understood the parts that were more mathematically involved.
When guessing the results of the players, it treats the ratings of the players who took over a country in CD different than their actual rating. More precisely, if w is the worth of their country at the moment they took over, then during the guess a fictitious amount of A x log(Bw) gets added to their vRating, where the log has a base of 1/B and A and B are constants. I wrote this formula in such a way that the constants A and B are easy to understand intuitively. They are what matters, since they fine-tune how takeovers are handled. I will now explain what the numbers A and B mean intuitively.
The number B is comparable to the number 0.5 that the site used before. Previously, if you took over a position that was twice as strong as other positions, you would bet the same amount as players who had been in the game from the start. Similarly, if we choose B = 0.5 then if you take over a position that is twice as strong as other positions, then your vRating will be treated the same as everyone elses. If you take over a weaker position than this, you will win more if you end up winning and lose less if you end up losing. If you take over a position stronger than this, you will win less if you end up winning and lose more if you end up losing. I suggest to indeed choose B = 0.5.
The number A represents how much vRating we want the computer to guess you have less if you take over a position that is as strong as all others, let's say at the start of the game. If A = 500 and you take over a position that is as strong as everybody elses, then after the game is over, you will win and/or lose as much as a player who has a vRating of 500 points below yours. I suggest to choose A = 500. This may sound as much, but we have to remember that a position with a worth of 1 is not always as strong as everybody elses. Take for example the position of a player who has an average number of centers, but who just got stabbed and left the game as a result. We don't want to punish players who take over such positions.
LAST POINTS
In case anyone wonders: Yes, this rating system is compatible with literally all scoring systems, including PPSC. I feel dirty of having created a rating system that is compatible with an abomination like PPSC, and I do not mean dirty in the positive sense of the word when you are playing Diplomacy. Without this compatibility, I feared it would not be implemented, though.
I have been a bit vague about the mathematics behind the rating system. The following is a link to a text where I dive into the mathematics in more detail.
https://drive.google.com/open?id=1emavObpJ9GB3zjxZLlv0894_eOGmVau_
I hope people like my ideas and I'd be happy to answer questions.
I hope this rating system can be implemented some day.