#+title: Zuckerberg's Facemash rankings: here comes the science
#+author: Leo Alekseyev
#+startup: noindent
#+property: session *R-babel*
#+property: tangle yes
* Introduction
If you've seen the movie /Social Network/, you know that back when Mark
Zuckerberg was just another Harvard undergrad with too much free time and a
knack for writing PHP, he became notorious for his personal twist on the
notorious /hotornot.com/ concept called /Facemash/. The idea was that you
compared people not by holding out virtual 10-point scale score cards (the
way hotornot.com operated), but
instead you were shown pictures of two different people, and you selected the
one you thought were better looking. Based on this, the system computed
the "hotness" ranking of a person.
In hindsight, we can speculate that aside from being a worthwhile exercise in
creating a Web 2.0 service, usage patterns of Facemash gave Mark invaluable
insight into the basic sociobiology of the internet. Indeed, there once was
a time when the social activities on the web were mainly restricted to
elevated, informed, and sophisticated discussions among like-minded
individuals (inasmuch as pornography consumption does not typically count as
a social activity). Of course, as we now understand, coherent discourse is
but one of a myriad ways in which we express our inclinations as social
creatures, and certainly not the default one. Edward Wilson, who coined the
term /sociobiology/ (incidentally, also at Harvard), popularized the famous
example that "an isolated [Rhesus monkey] will repeatedly pull a lever
with no reward other than the glimpse of another monkey". So, while
Zuckerberg started out with a drunken thesis that [[http://en.wikipedia.org/wiki/Facemash][Harvard undergrads weren't far
removed from farm animals in their looks]], he ended up proving a somewhat
related notion: that we all aren't far removed from Wilson's monkeys in our
social actions.
Zuckerberg took the lesson to heart, and the result is history. Facemash is
now a largely forgotten achievement, overshadowed by the Cal. Ave social
conglomerate. This is a shame, because Facemash has rather interesting
mathematical underpinnings. Indeed, we might ask: how does one go from a bunch of
photo comparisons to an absolute ranking? How reliable is the result? Is
there an optimal strategy for selecting which pictures to show, and how does
it depend on our objective?
These questions are general and are closely related to the problems of
ranking or seeding players in two-player contests, and, in fact, to any
setting where multiple binary comparisons are performed. In what follows, I
will arm myself with R, common sense, and a modicum of probability theory and
dig into how the ranking algorithm works. Or, in the immortal words of the
XKCD stick figure:
Stand back! I'm going to try science!
Thanks go to Patrick Collison of [[http://stripe.com][stripe.com]] for suggesting the idea!
One last thing, before we begin (if you are reading this online): this is the
project's org file. The org file is, effectively, a live lab notebook of the
analysis, with little post-processing of the final results. No effort will
be made to stick to formal scientific writing or to cherry-pick material for
the presentation or to be meticulous about LaTeX formatting.
* Background
We have a bunch of pairwise binary comparisons between N agents, and we want
to (at the minimum) estimate their ordinal rank, and, possibly, some "total
hotness" score. The first algorithm that comes to mind is the Elo rating
system for chess, since the two problems and underlying assumptions are
largely the same. This idea is not very original and, in fact, according to
Wikipedia, this is just what Zuckerberg actually used. The algorithm also
happens to be very simple, so let's implement it and see how well it works!
The basic idea of Elo is the following: the outcome of a match is a random
variable, and the probability of one person winning depends on the relative
skill difference between him and the opponent. At the end of the match, each
player's rank is changed in proportion to the difference between the expected
outcome (conditional on the rankings before the match) and the actual
outcome. If the ranking is correct, then the expected change in rank is
zero.
More information can be found on [[http://en.wikipedia.org/wiki/Elo_rating_system][Wikipedia]] and in some papers (this one, for
instance):
http://www.rcec.nl/publicaties/overige%20publicaties/cito_report.pdf
(As a side note, it seems that a large chunk of papers on Elo and alternative
ranking systems seems to come from animal researchers studying, e.g.,
"fecundity and survival rates in relation to social rank... [in] reindeer,
clown anemonefish, rabbits, mountain goats, [and] spotted hyaena".
Once again, Zuckerberg's desire to compare his fellow students to farm animals might have
been closer to the mark than he realized.)
* The plan
Let's implement Elo and see how it behaves under the following assumptions:
- each person has a "true" performance value (drawn from a distribution,
e.g. $\mathcal{N}(0,1)$)
- probability of winning in a match is $1/(1+\exp(-x S))$ where x is the
difference in the true performance values, as above. Depending on whether
your background is machine learning, physics, or plain old math, you will
recognize this curve as either the logistic function, or Fermi function, or
$1 + \tanh$. Here, S is the sensitivity parameter; for high S, small
differences in true ability will result in high (or low) probability of
winning.
- each match is either a win or loss (no draws); you get 1 point for a win, 0
for a loss
- Rankings are updated as per the formula:
$R(i) = R(i-1) + K (W - \mathbb{E}[W|R(i-1)])$
Here, W is the actual score, and $\mathbb{E}[W|R]$ is the expected score based on the
old ranking. K is a parameter that determines how fast the estimator
converges to the true rankings.
Questions we will consider now:
- how well does the algorithm converge to true ordinal rankings? To true performance values?
- how do the parameters S and K affect accuracy and convergence?
- what difference does the initial distribution make?
Other questions worth examining:
- what about the distribution of winning probabilities (e.g., what if we use
normal CDF instead of logistic CDF)?
- how well can we estimate the true performance distribution? How is
distribution different if we are studying chess performance?
Attractiveness? What about attractiveness of males vs females vs goats?
* The implementation
We will use R to implement the algorithm. Source code blocks are embedded
into this org file using org-babel. They are, effectively, wrapper functions
that can be called from org mode.
Global variables, parameters, and flags:
- =Nagents=: number of agents
- =Nmatches=: is actually the number of algorithm iterations. The number of
_pairwise_ matches is =Nagents= * =Nmatches=
- =S=, =K=: sensitivity/convergence parameters as described above
- =true.perf=: set of true performance values
- =distn=: underlying "true performance" distribution,
- =use.adaptive.k=:
* Simulate matches and generate ranking estimates
We take an array of agents (1:Nagents) and randomly assign to each a "true
performance" value drawn from some distribution using set.true.perf(). To
generate a set of random pairings, we match this array against its random
permutation. Note that this leaves a finite probability of an agent being
matched with him/herself, but this won't affect the overall conclusions. We
perform Nmatches of such permutations. At each step, all agents are randomly
paired with other agents, "fight", and the rank is updated for every agent.
Results of the simulation are maintained in variables:
ranking.evolution, ranking.evolution.ordering (Nmatches+1 x Nagents matrices)
Each row of these matrices can be compared to the true.perf and true.ordering vectors.
Each column of these matrices gives the evolution of absolute rank / ordinal
rank for a given agent.
#+NAME: elo_sim
#+HEADER: :var Nagents=200 :var Nmatches=10000 :var s=5 :var K=0.01 :var distn="normal" :var use.adaptive.k="FALSE"
#+BEGIN_SRC R :results output silent
set.seed(42)
## true.perf <- rnorm(Nagents,0,1)
## generate the "true performance" distribution
set.true.perf <- function(type = c("normal", "uniform", "u01",
"lognormal")) {
type <- match.arg(type)
switch(type,
normal = rnorm(Nagents,0,1),
uniform = runif(Nagents,-0.5,0.5),
u01 = runif(Nagents,0,1),
lognormal = {tmp <- rlnorm(Nagents,0,1); tmp - mean(tmp)}
)}
true.perf <- set.true.perf(distn)
## Probability of agent1 winning in a match where the true performance values
## are _perf1_ and _perf2_ (logistic curve)
prob.win.match <- function(perf1, perf2, s = 1)
{ 1/(1+exp((perf2-perf1)*s)) }
## Probability of winning for a set of agents with true performance values
## given in the array _ranking_ and with the order of opponents given by the
## array _opponents_. A wrapper over prob.win.match(...).
prob.win <- function(opponents, ranking = true.perf, s = 1)
{ prob.win.match(ranking, ranking[opponents], s) }
fixed.opponents <- sample.int(Nagents) #for exploration / debugging
outcomes.matrix <- array(0,c(Nagents,Nagents))
## Simulate random pairings (via sample.int(Nagents)). Simulate resultant match
## outcomes using a set of Bernoulli trials with probabilities of winning
## given by the logistic curve as per prob.win(...) invoked with true
## performance values. Update the ranking estimate.
rank.update <- function(oldrank, K, s = 1, fixed.opp = FALSE, outcome.mtx.update = FALSE) {
if (fixed.opp == FALSE) { opponents <- sample.int(Nagents) }
else { opponents <- fixed.opponents } # debugging
exp.outcome <- prob.win(opponents, ranking = true.perf, s = s)
assumed.outcome <- prob.win(opponents, ranking = oldrank, s = s)
trial.outcome <- rbinom(exp.outcome, size = 1, prob = exp.outcome)
if (outcome.mtx.update == TRUE) {
for(n in 1:Nagents) {
outcomes.matrix[n, opponents[n]] <<- outcomes.matrix[n, opponents[n]] + trial.outcome[n]
}
}
oldrank + K * (trial.outcome - assumed.outcome)
}
## Get order of vector elements (e.g. [4, 9, 5] gives [1, 3, 2])
get.ordering <- function(relative.vec)
{ match(relative.vec,sort(unique(relative.vec))) }
gen.adaptive.kvec <- function(kmax, kmin, length, ramp.length) {
k <- rep(kmin, length)
k[1:ramp.length] <- seq(kmax, kmin, length = ramp.length)
k
}
## R is column-major, so want to have longer columns
ranking.evolution <- array(0,c(Nmatches+1,Nagents))
## To simulate the introduction of a new agent into the set and observe how it
## evolves, we will zero out an existing rank for some agent midway through
## the simulation. To do so, invoke the function as
## run.trials(simulate.new.agent = c(AGENT_NUM, ITERATION_NUM))
run.trials <- function(simulate.new.agent = NULL, outcome.mtx.update = FALSE) {
if (!is.null(simulate.new.agent)) {
agent.num <- simulate.new.agent[1]
iteration.num <- simulate.new.agent[2]
}
if (use.adaptive.k == TRUE) {
kvec <- gen.adaptive.kvec(10 * K, K, Nmatches + 1, 0.15 * Nmatches)
} else {
kvec <- gen.adaptive.kvec(K, K, Nmatches + 1, 1)
}
for(n in 2:(Nmatches+1)) {
if (!is.null(simulate.new.agent) && (iteration.num == n))
{ ranking.evolution[n-1,agent.num] <<- 0 }
ranking.evolution[n,] <<- rank.update(ranking.evolution[n-1,],kvec[n],s,outcome.mtx.update=outcome.mtx.update)
}
}
run.trials()
ranking.evolution.ordering <- t(apply(ranking.evolution,1,get.ordering))
true.ordering <- get.ordering(true.perf)
#+END_SRC
* Analysis / assessment / figures of merit
To analyze the performance of the simulation, we use the known true ranking
values to compute the root-mean squared error, standard deviation of the
estimates, and error of the mean. We compute these metrics for every agent,
for every step in the simulation. Threshold _thr_ is used to ignore some
initial period of time (before the estimators stated converging to the true value).
We can then average these metrics over all agents to get some figures of
merit for the overall simulation, or examine a particular agent to see
e.g. whether outliers perform anomalously.
#+NAME: elo_sim_analysis
#+HEADER: :var idx=1 :var print="TRUE" :var plot.filename="conv1.png" :var plot.disp="FALSE"
#+BEGIN_SRC R :results output replace
## assess the quality of the ranking:
## predict is a numeric vector; throw out the first thr percent of it,
## then get MSE and standard deviation for the rest based on the known
## value true (NB: true is a scalar value for one agent)
get.error.indiv <- function(predict, true, thr = 0.15) {
start <- 1 + ceiling(thr * length(predict))
predict <- predict[start:length(predict)]
c(sqrt(mean((predict-true)^2)),sd(predict),abs(mean(predict)-true),mean(predict)-true)
}
## est is a matrix with NROW agents, NCOL trials
get.error.aggregate <- function(est, true, thr = 0.15) {
mapply(get.error.indiv, split(est,col(est)), true, MoreArgs = list(thr))
}
gen.sample.plots <- function(idx) {
par(mfcol=c(2,1),oma=c(0,0,2,0))
plot(ranking.evolution.ordering[,idx],col=1,type="l")
abline(h=true.ordering[idx],col=2)
title(paste("Ordinal ranking evolution for agent ",idx,"; true
value is ", true.ordering[idx], sep=""))
plot(ranking.evolution[,idx],col=1,type="l")
abline(h=true.perf[idx],col=2)
title(paste("Absolute ranking evolution for agent ",idx,"; true
value is ", format(true.perf[idx], digits=2),sep=""))
}
perf.check.ord <- get.error.aggregate(ranking.evolution.ordering, true.ordering, thr = 0.2)
perf.check.val <- get.error.aggregate(ranking.evolution, true.perf, thr = 0.2)
print.perf.check <- function(perf.check, ex.idx = idx) {
cat("RMSE, sdev, abs(mean - true), mean - true: avg over agents\n")
cat(rowMeans(perf.check),"\n")
cat("RMSE, sdev, abs(mean - true), mean - true: agent",ex.idx,"\n")
cat(perf.check[, ex.idx],"\n")
}
cat.fname.link <- function() { cat("[[file:",img.dir,"/",plot.filename,"]]\n",sep="") }
if (print == TRUE) {
cat("true.ordering[1:8]\n ",true.ordering[1:8],"\n")
cat("true.perf[1:8]\n ",true.perf[1:8],"\n")
cat("Performance check: ordinal rank estimate\n")
print.perf.check(perf.check.ord, idx)
cat("Performance check: underlying value estimate\n")
print.perf.check(perf.check.val, idx)
}
if (plot.disp == TRUE) { X11(); gen.sample.plots(idx) }
img.dir = "images"
if (!file.exists(img.dir)) { dir.create(img.dir) }
png(file=paste(img.dir,"/",plot.filename,sep=""))
gen.sample.plots(idx)
invisible(dev.off())
cat("[[file:",img.dir,"/",plot.filename,"]]\n",sep="")
#+END_SRC
#+RESULTS: elo_sim_analysis
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.380491 2.201938 0.7447513 5.817569e-16
RMSE, sdev, abs(mean - true), mean - true: agent 1
2.618611 2.184542 1.444125 1.444125
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1281709 0.06441655 0.1059347 0.02475723
RMSE, sdev, abs(mean - true), mean - true: agent 1
0.1899602 0.1097201 0.1550738 -0.1550738
[[file:images/conv1.png]]
#+end_example
* Time for experimentation!
** Rate of convergence, outliers, and the need for adaptive K
## org-babel tip: to run the source block, C-c C-c w/ point on the #+call line
#+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="FALSE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="FALSE")
Let's see what we got:
#+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv1.png",plot.disp="FALSE")
#+RESULTS: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv1.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.380491 2.201938 0.7447513 5.817569e-16
RMSE, sdev, abs(mean - true), mean - true: agent 1
2.618611 2.184542 1.444125 1.444125
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1281709 0.06441655 0.1059347 0.02475723
RMSE, sdev, abs(mean - true), mean - true: agent 1
0.1899602 0.1097201 0.1550738 -0.1550738
[[file:images/conv1.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+RESULTS:
[[file:images/conv1.png]]
[[file:images/conv1.png]]
What does the diagnostics tell us: on average, the rankings are correct +/- 2
steps in a field of 200. The RMSE for agent 1 (esp. for the value estimate)
is much higher than the standard deviation, suggesting it has problems
converging to the correct number. If we look at the plot, we see that this
is in fact the case. Notice now that Agent 1 is in the 95th percentile by
rank, and that its true performance is 1.4*sigma
## To see percentiles, C-c C-c the following line and look at output in the
## session buffer (*R-babel*) or *Messages* buffer
Now let's look at agent 8, which is closer to the 45th percentile. Does it
converge faster? Yes:
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="conv2.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="conv2.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.380491 2.201938 0.7447513 5.817569e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
3.302878 3.227995 0.70025 -0.70025
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1281709 0.06441655 0.1059347 0.02475723
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.03239877 0.02879233 0.01485887 0.01485887
[[file:images/conv2.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+RESULTS:
[[file:images/conv2.png]]
Notice how RMSE and sdev metrics for Agent 8 are of the same magnitude and
the ranking value appears to oscillate around its mean fairly early on.
Hypothesis: increasing K will help the "outlier" rank estimate converge quicker:
Rerun the simulation:
#+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.1, distn="normal", use.adaptive.k="FALSE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.1, distn="normal", use.adaptive.k="FALSE")
And look at the new plot:
#+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv3.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv3.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
6.171757 5.975447 1.232069 0
RMSE, sdev, abs(mean - true), mean - true: agent 1
4.672165 4.625406 0.661375 0.661375
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1586351 0.1252956 0.0856625 0.05684161
RMSE, sdev, abs(mean - true), mean - true: agent 1
0.2457172 0.1537633 0.1916683 0.1916683
[[file:images/conv3.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+RESULTS:
[[file:images/conv3.png]]
Indeed, it reached the expected value quite quickly. But look at the RMSE
and sdev error metrics compared to what we had before! They are much
increased. The simplest solution to this is using adaptive K: large at the
beginning, to get ballpark estimates for the ranks, and decrease it to refine
the estimates. How well is this going to work?
#+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv4.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv4.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.244224 2.065307 0.7269138 -5.462297e-16
RMSE, sdev, abs(mean - true), mean - true: agent 1
2.51217 2.081658 1.4065 1.4065
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.05051199 0.03320524 0.03300827 -0.01353418
RMSE, sdev, abs(mean - true), mean - true: agent 1
0.04825602 0.04000318 0.02699239 -0.02699239
[[file:images/conv4.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+RESULTS:
[[file:images/conv4.png]]
If you compare the performance of this run to that of the first one (with
non-adaptive K), you see that our average performance value estimate is
greatly improved. The average ordinal rank estimate, however, is improved
only slightly. The conclusion we may draw is that the Elo procedure is
rather robust with respect to estimating correct ordinal rank of an agent,
even if the true performance value is way off.
** Adding a new agent to the corpus
Now let's try to introduce a new agent into the corpus in the middle of
simulation. The quick-and-dirty way to do it is to modify the run.trials
function to re-zero a given agent's rank estimate midway through the
simulation. So, let's run this code:
#+begin_src R :results value silent
run.trials(simulate.new.agent = c(agent.num=1, iteration.num=5000))
#+end_src
#+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="sim_new_agent1.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="sim_new_agent1.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.244224 2.065307 0.7269138 -5.462297e-16
RMSE, sdev, abs(mean - true), mean - true: agent 1
2.51217 2.081658 1.4065 1.4065
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.05889624 0.03381955 0.04462162 -0.02583388
RMSE, sdev, abs(mean - true), mean - true: agent 1
0.2380088 0.2010747 0.1273665 -0.1273665
[[file:images/sim_new_agent1.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+RESULTS:
[[file:images/sim_new_agent1.png]]
Notice that we picked the "outlier" Agent 1, and even though we are using our
adaptive K strategy, by the time the re-zeroing happens, K is already fixed
at 0.01, the lowest value. Nonetheless, it starts approaching the true value
rather rapidly (which isn't too surprising, since the ranking updates of this
agent are rather precise, given that approximate true rank of other agents is
already known). In a "production" algorithm, of course, adaptive K should be
used with any new agents that are introduced.
** Effects of preference sensitivity (S)
*** What is the significance of S anyhow?
Here's one way to find out:
#+NAME: s_significance
#+BEGIN_SRC R :results output
explore.s.wrt.quantiles <- function(dist, s, nq = 20) {
tmp <- rev(quantile(dist,probs=seq(0,1,len=nq+1)))
pairs <- cbind(tmp,c(tmp[2:length(tmp)],NA))
apply(pairs,1,function(x) prob.win.match(x[1],x[2],s=s))
}
rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,5), s = 1, nq = 10)))
rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 5, nq = 10)))
rowMeans(replicate(1000,explore.s.wrt.quantiles(runif(200,0,1), s = 5, nq = 10)))
rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 1, nq = 10)))
rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 15, nq = 10)))
#+END_SRC
#+call: s_significance()
#+results: s_significance()
#+begin_example
100% 90% 80% 70% 60% 50% 40% 30%
0.9976393 0.8882985 0.8244165 0.7875210 0.7742437 0.7765346 0.7906393 0.8230313
20% 10% 0%
0.8931979 0.9974469 NA
100% 90% 80% 70% 60% 50% 40% 30%
0.9975603 0.8888554 0.8225140 0.7891638 0.7789783 0.7748193 0.7884782 0.8223398
20% 10% 0%
0.8892782 0.9977914 NA
100% 90% 80% 70% 60% 50% 40% 30%
0.6203402 0.6201246 0.6206393 0.6212999 0.6211669 0.6202980 0.6215165 0.6215658
20% 10% 0%
0.6212765 0.6217228 NA
100% 90% 80% 70% 60% 50% 40% 30%
0.8062400 0.6064312 0.5774024 0.5672994 0.5618731 0.5624628 0.5667661 0.5773851
20% 10% 0%
0.6059768 0.8063105 NA
100% 90% 80% 70% 60% 50% 40% 30%
0.9999995 0.9968155 0.9860055 0.9764557 0.9705964 0.9707642 0.9766836 0.9864255
20% 10% 0%
0.9965347 0.9999995 NA
#+end_example
Here we took some assumed true performance distribution and look at the
expected result of prob.win.match for agents that are in adjacent deciles. The
outcome depends on the underlying distribution, on the decile considered
(unless the dist'n is uniform), and on the value of S. So, for S=1, for a
normal distribution of attractiveness, if you mach up someone from the 6th
and 5th deciles, the more attractive person will win 56% of the time. For
S=5, the value is 78% of the time. For a uniform distribution w/ S=5, it's
62%.
So what is the more sensible behavior of S for the case of comparing
attractiveness? Well, that's not very clear, but it's a rather interesting
question. Based on my personal guess of how S should behave, I think that
the true distribution of attractiveness is somewhere in between the
normal and the uniform. Also, we can probably conclude that S=1 is too low,
and S=15 is way too high.
Finally, note (cf the first two lines of the output) that manipulating S is
_the same_ as manipulating the variance of the underlying performance
distribution. In other words, in this model, it is this distribution that
determines the effective preference sensitivities. S is just more convenient
to work with (as opposed to regenerating the performance numbers w/ a
different variance).
Let's see what the typical simulation curves look like for S={1,5,15} for the
normal distribution.
*** S = 1:
#+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s1-norm.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s1-norm.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
4.333445 3.947019 1.517871 5.906386e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
5.33335 5.128124 1.466375 -1.466375
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.0837952 0.06482169 0.04910551 -0.04870657
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.09070457 0.05776304 0.06993688 -0.06993688
[[file:images/s1-norm.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
*** S = 5
#+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s5-norm.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s5-norm.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
2.244224 2.065307 0.7269138 -5.462297e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
3.24332 3.156469 0.746375 -0.746375
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.05051199 0.03320524 0.03300827 -0.01353418
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.04412241 0.03175029 0.03064038 -0.03064038
[[file:images/s5-norm.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
*** S = 15
#+call: elo_sim(Nagents=200, Nmatches=10000, s=15, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=15, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s15-norm.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s15-norm.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
1.549534 1.421335 0.5199238 -5.32907e-17
RMSE, sdev, abs(mean - true), mean - true: agent 8
1.856845 1.833074 0.296875 -0.296875
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.2281020 0.02855655 0.2249102 0.04291445
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.04505188 0.01655123 0.04190182 0.04190182
[[file:images/s15-norm.png]]
#+end_example
*** Discussion:
Wow, this is pretty interesting. Notice how _ordinal rank_ gets more and
more precise with increasing S, but for S that's too high (e.g. S=15), the
_value rank_ estimate can converge to the wrong value! Notice how average
RMSE is large but average sdev is small: that means that it's oscillating
about something other than the value we expect. This is probably due to the
fact that if the system is overly sensitive, i.e. virtually any advantage
results in an almost sure win, backing out the true performance becomes hard.
** What about a different underlying distribution?
*** Uniform
#+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="uniform", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="uniform", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s5.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s5.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
179 187 54 167 120 94 141 25
true.perf[1:8]
0.4148060 0.4370754 -0.2138605 0.3304476 0.1417455 0.01909595 0.2365883 -0.3653334
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
6.638032 6.30027 1.726002 5.417888e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
6.039774 5.850301 1.502375 1.502375
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1170928 0.03192453 0.1125994 0.1125994
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.1157774 0.02714986 0.1125495 0.1125495
[[file:images/unif-s5.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
#+call: elo_sim(Nagents=200, Nmatches=10000, s=10, K=0.01, distn="uniform", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=10, K=0.01, distn="uniform", use.adaptive.k="TRUE")
#+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s10.png",plot.disp="FALSE")
#+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s10.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
179 187 54 167 120 94 141 25
true.perf[1:8]
0.4148060 0.4370754 -0.2138605 0.3304476 0.1417455 0.01909595 0.2365883 -0.3653334
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
4.886145 4.625853 1.313754 -1.154632e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
5.837786 5.317555 2.40975 2.40975
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.03199771 0.02303463 0.02180407 0.02180407
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.03552381 0.02598946 0.02421929 0.02421929
[[file:images/unif-s10.png]]
#+end_example
Setting S=10 for the uniform dist'n results in a preference between adjacent
deciles similar to that of the normal dist'n with S=5. It seems to give a
more precise estimate for the rank value. It's odd, though, that we are
observing a strong bias in the rank value estimation for S that is too low
(S=5, in this case) -- with the uniform dist'n we saw this behavior when S
was too high.
*** Lognormal distribution
Lognormal is skewed; in our situation it corresponds to a hard cut-off of
unattractiveness and a long-ish tail of very attractive people
#+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="lognormal", use.adaptive.k="TRUE")
#+results: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="lognormal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="lognorms1.png",plot.disp="FALSE")
#+results: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="lognorms1.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
2.407417 -0.9631754 -0.09388662 0.351286 -0.03350121 -0.6323944 3.001919 -0.622024
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
5.292368 4.862564 1.698502 -5.218048e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
6.456528 5.937467 2.53725 -2.53725
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.1326077 0.06758075 0.08547004 -0.06061546
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.06518846 0.06219284 0.01954661 -0.01954661
[[file:images/lognorms1.png]]
#+end_example
#+begin_src R :results output raw :exports results
cat.fname.link()
#+end_src
We can see that the estimates converge pretty well, and the results do look
like a lognormal distribution:
#+BEGIN_SRC R :eval no-export
par(mfrow=c(2,1))
hist(true.perf,breaks=30)
hist(ranking.evolution[10000,],breaks=30)
#+END_SRC
* The Bradley-Terry R package
** Background
I stumbled upon a very recent (i.e. last version was released < 2 weeks ago)
R package that solves paired comparison problems (also known, apparently, as
Bradley-Terry problems) using MLE model fitting:
http://cran.r-project.org/web/packages/BradleyTerry2/
I'm obviously very curious about the implementation of the MLE approach, and
its advantages/drawbacks compared to the Elo algorithm. I hope to explore
this when I have more time.
For the time being, though, let's quickly modify the code to be compatible
with the BradleyTerry2 package (for this, add a boolean flag to run.trials
that populates the Nagents x Nagents matrix global variable outcomes.matrix;
there, the element ij gives the number of wins of Agent i over Agent j).
We'll compare the results with the Elo answer.
** Re-run the simulation, and run the Bradley-Terry version
#+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE")
#+call: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="bt_comp_norms1.png",plot.disp="FALSE")
#+results: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="bt_comp_norms1.png",plot.disp="FALSE")
#+begin_example
true.ordering[1:8]
184 53 131 150 134 92 191 93
true.perf[1:8]
1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904
Performance check: ordinal rank estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
4.333445 3.947019 1.517871 5.906386e-16
RMSE, sdev, abs(mean - true), mean - true: agent 8
5.33335 5.128124 1.466375 -1.466375
Performance check: underlying value estimate
RMSE, sdev, abs(mean - true), mean - true: avg over agents
0.07745738 0.0663651 0.0353229 -0.03187495
RMSE, sdev, abs(mean - true), mean - true: agent 8
0.07011404 0.06783307 0.01775465 -0.01775465
[[file:images/bt_comp_norms1.png]]
#+end_example
#+begin_src R :results output silent
run.trials(outcome.mtx.update=TRUE)
#+end_src
#+NAME: bt_package_eval
#+BEGIN_SRC R :var display.hist="FALSE"
#install.packages("BradleyTerry2")
library(BradleyTerry2)
rownames(outcomes.matrix) <- as.character(1:Nagents)
colnames(outcomes.matrix) <- as.character(1:Nagents)
pair.wins.df <- countsToBinomial(outcomes.matrix)
attrModel <- BTm(cbind(win1, win2), player1, player2, ~player, id = "player", data = pair.wins.df)
bt.rankings <- c(0, attrModel$coefficients) # need to de-mean them
# player 1 by convention is assigned rank 0 since only relative rank matters
bt.rankings <- bt.rankings - mean(bt.rankings)
if(display.hist == TRUE) {
oldpar <- par()
par(mfrow=c(3,1))
hist(true.perf,breaks=30)
hist(ranking.evolution[10000,],breaks=30)
hist(bt.rankings,breaks=30)
par(oldpar)
}
bt.ordering <- get.ordering(bt.rankings)
elo.ordering <- ranking.evolution.ordering[10000,]
elo.ordering.mean <- colMeans(ranking.evolution.ordering[7000:10000,])
cat("Errors in the orderings: Elo, Elo mean over 3000 iter, BT\n")
bt.err.cmp.fn <- function(cmp, ann="", true=true.ordering) { cat(ann, rowMeans(mapply(get.error.indiv, cmp, true, MoreArgs=list(0))), "\n") }
invisible(mapply(bt.err.cmp.fn, list(elo.ordering, elo.ordering.mean, bt.ordering)
, list("Elo:", "Elo mean:", "BT:"), MoreArgs=list(true=true.ordering)))
cat("Errors in the rankings: Elo, Elo mean over 3000 iter, BT\n")
invisible(mapply(bt.err.cmp.fn, list(ranking.evolution[10000,],
colMeans(ranking.evolution[7000:10000,]), bt.rankings)
, list("Elo:", "Elo mean:", "BT:"), MoreArgs=list(true=true.perf)))
#+END_SRC
#+call: bt_package_eval[:results output replace](display.hist="FALSE")
Ooh, look at that! (Btw, only the first entry in each line matters). You can
see that B-T algorithm beats even the averaged Elo (the RMSE values are ~ 1,
2, and 3 for the ordinal estimate, and 0.03, 0.03, and 0.05 for the rank
value estimate). The trade-off is that it seems to take a little longer to
run.
* Some conclusions and further questions to ponder
** Conclusions from the simulation
1. Simple Elo algorithm works, and works pretty well under many
circumstances. However, its performance (under the logit pair preference
assumption) seems to depend rather strongly on the variance of the true
underlying distribution (as represented here by the S parameter).
2. R is a very convenient tool or exploring this problem. In particular, the
simulation process adopted here allows very easily to estimate both the rate
of convergence of the estimator to its expected value, its oscillations about
the mean value, and any bias it might have (by comparing RMSE against the
sdev of the estimator over time).
3. Depending the underlying dist'n and its variance, the typical error of the
ordinal rank estimation is on the order of 1-3% (3% corresponding to 6 places
out of 200). The performance value estimation was typically in the range of
0.03 -- 0.08 sigma (the dist'n standard deviation), but could reach high
values, e.g. 0.2 sigma, if estimators didn't converge to the correct mean.
Curiously, even in this case, the ordinal rank estimation error was very low.
4. The only enhancement that is really desirable over the simplest Elo
implementation is the need for adaptive K -- that is, the ranking score
updates by larger increments in the beginning to achieve fast ballpark
estimates, then K is dialed down to help maintain precision.
5. When new agent is introduced into the corpus, it converges to its true
ranking quiet rapidly.
6. It turns out that this class of problems is well known in the statistics
community (e.g. google for Mark Glickman's papers). The BradleyTerry2 R
package provides an MLE-based solution to this problem and outperforms my
Elo implementation in accuracy (but not in speed!).
** If we had a real algorithm running
... then it might provide an source of some pretty interesting data analysis.
What is the actual distribution of attractiveness? How does the decile
sensitivity study above look for the actual data? What is the variance in
people's assessment when you have different people evaluating the same
pictures? How do these answers change depending on the image corpus
(e.g. all girls, all guys, all goats, random images of stuff, animals from
cuteoverload)?
Also, an interesting thing to study would be the effect of deliberately showing
pictures that are close together in ranking vs showing ones that have
disparate rank. Which one would lead to more user engagement? More
accurate set of rankings?
** Final questions to ponder:
How do we election what pictures to show next (or, more generally, how do we
select the next agents for pairwise comparison)?
In this simulation, the strategy was completely random. What do we
gain with more fine-grained control? One possibility is that we can converge
to the true rankings and values quicker if we first only work with a small
subset of the agents, establishing accurate rankings for them, and then
introduce additional agents into the corpus gradually. Indeed, we saw that
in this case the ranking of a new agent takes much less time to establish.
Another interesting question to consider is the fact that each user's
evaluation of attractiveness is going to be different. Can we make things
more precise by clustering the ratings by user, knowing that within each
cluster the rating scale is the same? (Probably not going to make a huge
impact if the number of users is large.)
** Further work
Things that I should be do at some point:
- Look at the R implementation of the Bradley-Terry MLE approach
- Skim through Glickman's papers to (maybe) get a better feel for the
subject
Things that I probably won't get around to exploring, but that I wish I
understood:
- Rank updates of the Elo algorithm generate a random process. It would be
nice to know some of its mathematical characteristics. For instance, can
it be related to other mean-reverting stochastic processes in discrete
time? I've read somewhere that the process is not stationary (and various
statements to the effect that Elo scores drift over time). This is not
what I'm empirically observing in the simulation. Why not, and what
consequences does non-stationarity have?
- Empirically, in my simulations, some underlying probability distributions
result in the estimator converging to incorrect rank values (while the
ordinal rank estimation remains very good). I provided a hand-waving
explanation, but I'm not sure it's correct. It would be nice to know
exactly what's going on.