Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Monday, January 24, 2011

USCL Game of the Year Judging Analysis

I performed some statistics on the judging in United States Chess League's 2010 Game of the Year contest.

There were five judges: Hess, Gustafsson, Johannesson, Melekhina, Young. I will refer to them by the first letter of their last name.

Several analyses were completed.

What are the games for which the judges agreed most and disagreed most?

This can be calculated by looking at the standard deviations of the scores on each game.

The most agreed upon games were:
  1. #20, Sammour-Hasbun vs. Kaplan (sd = 2.51)
  2. #2, Sammour-Hasbun vs. Kacheishvili (sd = 2.61)
  3. #4, Rosen-Guo (sd = 2.97)
The most disagreed upon games were:
  1. #13 Schroer vs. Kacheishvili (sd = 7.99)
  2. #19 Galofre vs. Milat (sd = 7.80)
  3. #12 Friedel vs. Akobian (sd = 7.36)
Which judges were most different?

I calculated which of the judges were "most different" than the combined wisdom of all the judges together. The judges that were the most different could be considered outliers.

There are several ways to do this. I will demonstrate two approaches.

FINDING THE OUTLIER JUDGES

First, I compared the score a judge gave to the average of all the judges, but tempering that by the amount of disagreement of all the judges. For instance Judge Y gave 2 points (19th place) to Schroer-Kacheishvili, while the average number of points was 9.2, and the standard deviation (the amount of disagreement) was 7.99. Therefore, For that game, Judge Y would receive the absolute value of (2 - 9.2)/7.99 or 0.80 "difference points". For each of the twenty games, add up the difference points. The more the difference points, the more different the judge was from the other judges.

The total number of difference points were...
Judge Y: 17.49
Judge J: 11.09
Judge M: 19.40
Judge G: 12.80
Judge H: 16.96

Therefore, Judge Y and Judge M were the most different from the other judges.

Then, we could discard the scores of these two judges, and rescore the contest.

See below for how the results would have changed.


COMPUTE THE MIDDLE SCORES FOR EACH GAME

Another way of rescoring the contest is to do it on a "per game" basis, as opposed to throwing judges as a whole. Instead, discard the high and low scores given to each game, and create a new total.

For example, Golfre-Milat received scores of 1, 1, 1, 5, and 19. If we were to use this method, we would throw out one of the 1s and the 19, and the game would received a revised score of 7.

. . .

The table below shows the original place for each game, as well as the place it would have come it if you used the "Three Judges Only" method, or the "No Hi-Lo" method. Ties were not broken for these alternate methods.

GAME Original Three Judges No Hi-Lo
Sammour-Hasbun vs Kaplan 20 19 19
Galofre vs Milat 19 20 20
Gurevich vs Barcenilla 18 18 18
Akobian vs Friedel 17 T13-14 17
Rosenthal vs Thompson 16 T15-17 15
Krasik vs Balasubramanian 15 T13-14 16
Hungaski vs Schroer 14 T15-17 13
Schroer vs Kacheishvili 13 T15-17 14
Friedel vs Akobian 12 T11-12 12
Shulman - Felecan 11 T11-12 11
Rensch - Abrahamyan 10 T4-5 T7-10
Shankland vs Becerra 9 8 T7-10
Stripunsky vs Erenburg 8 10 T7-10
Christiansen vs Kraai 7 T6-7 T7-10
Schroer vs Christiansen 6 T4-5 4
Kacheishvili vs Shankland 5 9 T5-6
Rosen vs Guo 4 T6-7 T5-6
Shulman vs Khachiyan 3 2 2
Sammour-Hasbun vs Kacheishvili 2 3 3
Akobian vs Shulman 1 1 1


Readers are invited to make their own conclusions.

Friday, September 11, 2009

Game of the Week Controversy

Only week 2 in the United States Chess League, and there's already controversy about the "Game-of-the-Week" judging. It appears that several members of the Boston Blitz are upset that their team member, Marc Esserman, only came in second place in the voting.

There have been claims of a "whacko judge". But how can we define "whacko"? Quantitatively, I mean...

One method is by performing standard Pearson correlations on an individual judge's scores with the total scores of all the judges. Yes, there's some co-linearity issues, but I'm not really going to worry about that.

Correlations produce a score called "r", which ranges from -1 to 1. When r=1, it means that one set of data lines up exactly with another set of data. For example, the ordered data set {1, 2, 3} and {2, 4, 6} have an r=1. On the flip side, an r =-1 means that one data set is exactly opposite (in terms of lining up) with another. Take the data set {1, 2, 3} again; the data set {6, 4, 2} has an r=-1, because 1 and 6, 2 and 4, 3 and 2, line up in the opposite order.

So what does an r=0 mean? It means that the two data sets have absolutely no statistical correlation at all, either negative or positive; it is like there are two random data sets.

In terms of USCL GOTW judging, what we would like to see is some sort of positive correlation between each judge's scores and the scores given to the games as a whole. If there is some sort of "whacko" judge, then they should have a near zero or a negative correlation. Indeed, even if one judge has a noticeably lower positive correlation than the other judges, it something of which to keep track.

For Week 2, GOTW judging, I calculated the correlations between a judge's score and the tally of all the judges. Before I go into the numbers, let me first give an example.

Winner / Total Points / Score of "Greg"
Friedel / 18 / 2
Esserman / 14 / 5
Ehlvest / 9 / 0
Perelshteyn / 9 / 3
Charbonneau / 6 / 0
Krasik / 6 / 4
Lopez / 4 / 0
Zaremba / 3 / 0
Becerra / 3 / 0
Matlin / 1 / 0
Recio / 1 / 1
Altounian-Burnett / 1 / 0

The Pearson correlation ("r") between Greg's scores and the scores of all the judges was 0.594, which is a high positive number, which indicates that his choices matched reasonably well with the choices of the group as a whole.

Let's look at the correlations of all the judges of week 2, from highest (i.e., agreed closest with the group) to lowest.

Week 2 Correlations
Arun: 0.770
Jim: 0.733
Jeff: 0.604
Greg: 0.594
Michael: 0.280

Michael's score is well-below the other judges, but still positive, which means that he did agree somewhat with all the other judges.

So, let's go back and see what happened for the Week 1 voting, where the results were apparently less controversial.

Week 1 Correlations
Jeff: 0.882
Michael: 0.799
Arun: 0.683
Jim: 0.683
Greg: 0.667

These are all reasonably close, when Jeff and Michael having a better correlation.

So to summarize, we can determine the average correlation for both weeks.

Average Correlation for both weeks
Jeff: 0.743
Arun: 0.727
Jim: 0.708
Greg: 0.631
Michael: 0.540

All are reasonably high, so I think it may be too early to refer to a "whacko" judge.