Friday, September 11, 2009

Game of the Week Controversy

Only week 2 in the United States Chess League, and there's already controversy about the "Game-of-the-Week" judging. It appears that several members of the Boston Blitz are upset that their team member, Marc Esserman, only came in second place in the voting.

There have been claims of a "whacko judge". But how can we define "whacko"? Quantitatively, I mean...

One method is by performing standard Pearson correlations on an individual judge's scores with the total scores of all the judges. Yes, there's some co-linearity issues, but I'm not really going to worry about that.

Correlations produce a score called "r", which ranges from -1 to 1. When r=1, it means that one set of data lines up exactly with another set of data. For example, the ordered data set {1, 2, 3} and {2, 4, 6} have an r=1. On the flip side, an r =-1 means that one data set is exactly opposite (in terms of lining up) with another. Take the data set {1, 2, 3} again; the data set {6, 4, 2} has an r=-1, because 1 and 6, 2 and 4, 3 and 2, line up in the opposite order.

So what does an r=0 mean? It means that the two data sets have absolutely no statistical correlation at all, either negative or positive; it is like there are two random data sets.

In terms of USCL GOTW judging, what we would like to see is some sort of positive correlation between each judge's scores and the scores given to the games as a whole. If there is some sort of "whacko" judge, then they should have a near zero or a negative correlation. Indeed, even if one judge has a noticeably lower positive correlation than the other judges, it something of which to keep track.

For Week 2, GOTW judging, I calculated the correlations between a judge's score and the tally of all the judges. Before I go into the numbers, let me first give an example.

Winner / Total Points / Score of "Greg"
Friedel / 18 / 2
Esserman / 14 / 5
Ehlvest / 9 / 0
Perelshteyn / 9 / 3
Charbonneau / 6 / 0
Krasik / 6 / 4
Lopez / 4 / 0
Zaremba / 3 / 0
Becerra / 3 / 0
Matlin / 1 / 0
Recio / 1 / 1
Altounian-Burnett / 1 / 0

The Pearson correlation ("r") between Greg's scores and the scores of all the judges was 0.594, which is a high positive number, which indicates that his choices matched reasonably well with the choices of the group as a whole.

Let's look at the correlations of all the judges of week 2, from highest (i.e., agreed closest with the group) to lowest.

Week 2 Correlations
Arun: 0.770
Jim: 0.733
Jeff: 0.604
Greg: 0.594
Michael: 0.280

Michael's score is well-below the other judges, but still positive, which means that he did agree somewhat with all the other judges.

So, let's go back and see what happened for the Week 1 voting, where the results were apparently less controversial.

Week 1 Correlations
Jeff: 0.882
Michael: 0.799
Arun: 0.683
Jim: 0.683
Greg: 0.667

These are all reasonably close, when Jeff and Michael having a better correlation.

So to summarize, we can determine the average correlation for both weeks.

Average Correlation for both weeks
Jeff: 0.743
Arun: 0.727
Jim: 0.708
Greg: 0.631
Michael: 0.540

All are reasonably high, so I think it may be too early to refer to a "whacko" judge.


Anonymous said...

BL, you are a whacko statistician!


Elizabeth Vicary said...

What’s the correlation between Boston not winning something and lots of loud complaining?

Anonymous said...


We are the loudest, most obnoxious band of badasses in the USCL. Go ahead, get all riled up and try to beat us! Press for a win in that drawn ending; try to punish us for that opening inaccuracy; we'll take the points, thank you very much.


(BTW, I was joking in my previous comment.)


Bionic Lime said...

There's a follow up to this article that I've posted. Lends a tad more credence to the whacko theory, and introduces the "whackometer".