There have been claims of a "whacko judge". But how can we define "whacko"? Quantitatively, I mean...

One method is by performing standard Pearson correlations on an individual judge's scores with the total scores of all the judges. Yes, there's some co-linearity issues, but I'm not really going to worry about that.

Correlations produce a score called "r", which ranges from -1 to 1. When r=1, it means that one set of data lines up exactly with another set of data. For example, the ordered data set {1, 2, 3} and {2, 4, 6} have an r=1. On the flip side, an r =-1 means that one data set is exactly opposite (in terms of lining up) with another. Take the data set {1, 2, 3} again; the data set {6, 4, 2} has an r=-1, because 1 and 6, 2 and 4, 3 and 2, line up in the opposite order.

So what does an r=0 mean? It means that the two data sets have absolutely no statistical correlation at all, either negative or positive; it is like there are two random data sets.

In terms of USCL GOTW judging, what we would like to see is some sort of positive correlation between each judge's scores and the scores given to the games as a whole. If there is some sort of "whacko" judge, then they should have a near zero or a negative correlation. Indeed, even if one judge has a noticeably lower positive correlation than the other judges, it something of which to keep track.

For Week 2, GOTW judging, I calculated the correlations between a judge's score and the tally of all the judges. Before I go into the numbers, let me first give an example.

Winner / Total Points / Score of "Greg"

Friedel / 18 / 2 |

Esserman / 14 / 5 |

Ehlvest / 9 / 0 |

Perelshteyn / 9 / 3 |

Charbonneau / 6 / 0 |

Krasik / 6 / 4 |

Lopez / 4 / 0 |

Zaremba / 3 / 0 |

Becerra / 3 / 0 |

Matlin / 1 / 0 |

Recio / 1 / 1 |

Altounian-Burnett / 1 / 0 |

The Pearson correlation ("r") between Greg's scores and the scores of all the judges was 0.594, which is a high positive number, which indicates that his choices matched reasonably well with the choices of the group as a whole.

Let's look at the correlations of all the judges of week 2, from highest (i.e., agreed closest with the group) to lowest.

Week 2 Correlations

Arun: 0.770

Jim: 0.733

Jeff: 0.604

Greg: 0.594

Michael: 0.280

Michael's score is well-below the other judges, but still positive, which means that he did agree somewhat with all the other judges.

So, let's go back and see what happened for the Week 1 voting, where the results were apparently less controversial.

Week 1 Correlations

Jeff: 0.882

Michael: 0.799

Arun: 0.683

Jim: 0.683

Greg: 0.667

These are all reasonably close, when Jeff and Michael having a better correlation.

So to summarize, we can determine the average correlation for both weeks.

Average Correlation for both weeks

Jeff: 0.743

Arun: 0.727

Jim: 0.708

Greg: 0.631

Michael: 0.540

All are reasonably high, so I think it may be too early to refer to a "whacko" judge.

## 4 comments:

BL, you are a whacko statistician!

-Matt

What’s the correlation between Boston not winning something and lots of loud complaining?

1:1

We are the loudest, most obnoxious band of badasses in the USCL. Go ahead, get all riled up and try to beat us! Press for a win in that drawn ending; try to punish us for that opening inaccuracy; we'll take the points, thank you very much.

:)

(BTW, I was joking in my previous comment.)

-Matt

There's a follow up to this article that I've posted. Lends a tad more credence to the whacko theory, and introduces the "whackometer".

http://bioniclime.blogspot.com/2009/09/gotw-whackometer-part-2.html

Post a Comment