#+title: Zuckerberg's Facemash rankings: here comes the science #+author: Leo Alekseyev #+startup: noindent #+property: session *R-babel* #+property: tangle yes * Introduction If you've seen the movie /Social Network/, you know that back when Mark Zuckerberg was just another Harvard undergrad with too much free time and a knack for writing PHP, he became notorious for his personal twist on the notorious /hotornot.com/ concept called /Facemash/. The idea was that you compared people not by holding out virtual 10-point scale score cards (the way hotornot.com operated), but instead you were shown pictures of two different people, and you selected the one you thought were better looking. Based on this, the system computed the "hotness" ranking of a person. In hindsight, we can speculate that aside from being a worthwhile exercise in creating a Web 2.0 service, usage patterns of Facemash gave Mark invaluable insight into the basic sociobiology of the internet. Indeed, there once was a time when the social activities on the web were mainly restricted to elevated, informed, and sophisticated discussions among like-minded individuals (inasmuch as pornography consumption does not typically count as a social activity). Of course, as we now understand, coherent discourse is but one of a myriad ways in which we express our inclinations as social creatures, and certainly not the default one. Edward Wilson, who coined the term /sociobiology/ (incidentally, also at Harvard), popularized the famous example that "an isolated [Rhesus monkey] will repeatedly pull a lever with no reward other than the glimpse of another monkey". So, while Zuckerberg started out with a drunken thesis that [[http://en.wikipedia.org/wiki/Facemash][Harvard undergrads weren't far removed from farm animals in their looks]], he ended up proving a somewhat related notion: that we all aren't far removed from Wilson's monkeys in our social actions. Zuckerberg took the lesson to heart, and the result is history. Facemash is now a largely forgotten achievement, overshadowed by the Cal. Ave social conglomerate. This is a shame, because Facemash has rather interesting mathematical underpinnings. Indeed, we might ask: how does one go from a bunch of photo comparisons to an absolute ranking? How reliable is the result? Is there an optimal strategy for selecting which pictures to show, and how does it depend on our objective? These questions are general and are closely related to the problems of ranking or seeding players in two-player contests, and, in fact, to any setting where multiple binary comparisons are performed. In what follows, I will arm myself with R, common sense, and a modicum of probability theory and dig into how the ranking algorithm works. Or, in the immortal words of the XKCD stick figure: Stand back! I'm going to try science! Thanks go to Patrick Collison of [[http://stripe.com][stripe.com]] for suggesting the idea! One last thing, before we begin (if you are reading this online): this is the project's org file. The org file is, effectively, a live lab notebook of the analysis, with little post-processing of the final results. No effort will be made to stick to formal scientific writing or to cherry-pick material for the presentation or to be meticulous about LaTeX formatting. * Background We have a bunch of pairwise binary comparisons between N agents, and we want to (at the minimum) estimate their ordinal rank, and, possibly, some "total hotness" score. The first algorithm that comes to mind is the Elo rating system for chess, since the two problems and underlying assumptions are largely the same. This idea is not very original and, in fact, according to Wikipedia, this is just what Zuckerberg actually used. The algorithm also happens to be very simple, so let's implement it and see how well it works! The basic idea of Elo is the following: the outcome of a match is a random variable, and the probability of one person winning depends on the relative skill difference between him and the opponent. At the end of the match, each player's rank is changed in proportion to the difference between the expected outcome (conditional on the rankings before the match) and the actual outcome. If the ranking is correct, then the expected change in rank is zero. More information can be found on [[http://en.wikipedia.org/wiki/Elo_rating_system][Wikipedia]] and in some papers (this one, for instance): http://www.rcec.nl/publicaties/overige%20publicaties/cito_report.pdf (As a side note, it seems that a large chunk of papers on Elo and alternative ranking systems seems to come from animal researchers studying, e.g., "fecundity and survival rates in relation to social rank... [in] reindeer, clown anemonefish, rabbits, mountain goats, [and] spotted hyaena". Once again, Zuckerberg's desire to compare his fellow students to farm animals might have been closer to the mark than he realized.) * The plan Let's implement Elo and see how it behaves under the following assumptions: - each person has a "true" performance value (drawn from a distribution, e.g. $\mathcal{N}(0,1)$) - probability of winning in a match is $1/(1+\exp(-x S))$ where x is the difference in the true performance values, as above. Depending on whether your background is machine learning, physics, or plain old math, you will recognize this curve as either the logistic function, or Fermi function, or $1 + \tanh$. Here, S is the sensitivity parameter; for high S, small differences in true ability will result in high (or low) probability of winning. - each match is either a win or loss (no draws); you get 1 point for a win, 0 for a loss - Rankings are updated as per the formula: $R(i) = R(i-1) + K (W - \mathbb{E}[W|R(i-1)])$ Here, W is the actual score, and $\mathbb{E}[W|R]$ is the expected score based on the old ranking. K is a parameter that determines how fast the estimator converges to the true rankings. Questions we will consider now: - how well does the algorithm converge to true ordinal rankings? To true performance values? - how do the parameters S and K affect accuracy and convergence? - what difference does the initial distribution make? Other questions worth examining: - what about the distribution of winning probabilities (e.g., what if we use normal CDF instead of logistic CDF)? - how well can we estimate the true performance distribution? How is distribution different if we are studying chess performance? Attractiveness? What about attractiveness of males vs females vs goats? * The implementation We will use R to implement the algorithm. Source code blocks are embedded into this org file using org-babel. They are, effectively, wrapper functions that can be called from org mode. Global variables, parameters, and flags: - =Nagents=: number of agents - =Nmatches=: is actually the number of algorithm iterations. The number of _pairwise_ matches is =Nagents= * =Nmatches= - =S=, =K=: sensitivity/convergence parameters as described above - =true.perf=: set of true performance values - =distn=: underlying "true performance" distribution, - =use.adaptive.k=: * Simulate matches and generate ranking estimates We take an array of agents (1:Nagents) and randomly assign to each a "true performance" value drawn from some distribution using set.true.perf(). To generate a set of random pairings, we match this array against its random permutation. Note that this leaves a finite probability of an agent being matched with him/herself, but this won't affect the overall conclusions. We perform Nmatches of such permutations. At each step, all agents are randomly paired with other agents, "fight", and the rank is updated for every agent. Results of the simulation are maintained in variables: ranking.evolution, ranking.evolution.ordering (Nmatches+1 x Nagents matrices) Each row of these matrices can be compared to the true.perf and true.ordering vectors. Each column of these matrices gives the evolution of absolute rank / ordinal rank for a given agent. #+NAME: elo_sim #+HEADER: :var Nagents=200 :var Nmatches=10000 :var s=5 :var K=0.01 :var distn="normal" :var use.adaptive.k="FALSE" #+BEGIN_SRC R :results output silent set.seed(42) ## true.perf <- rnorm(Nagents,0,1) ## generate the "true performance" distribution set.true.perf <- function(type = c("normal", "uniform", "u01", "lognormal")) { type <- match.arg(type) switch(type, normal = rnorm(Nagents,0,1), uniform = runif(Nagents,-0.5,0.5), u01 = runif(Nagents,0,1), lognormal = {tmp <- rlnorm(Nagents,0,1); tmp - mean(tmp)} )} true.perf <- set.true.perf(distn) ## Probability of agent1 winning in a match where the true performance values ## are _perf1_ and _perf2_ (logistic curve) prob.win.match <- function(perf1, perf2, s = 1) { 1/(1+exp((perf2-perf1)*s)) } ## Probability of winning for a set of agents with true performance values ## given in the array _ranking_ and with the order of opponents given by the ## array _opponents_. A wrapper over prob.win.match(...). prob.win <- function(opponents, ranking = true.perf, s = 1) { prob.win.match(ranking, ranking[opponents], s) } fixed.opponents <- sample.int(Nagents) #for exploration / debugging outcomes.matrix <- array(0,c(Nagents,Nagents)) ## Simulate random pairings (via sample.int(Nagents)). Simulate resultant match ## outcomes using a set of Bernoulli trials with probabilities of winning ## given by the logistic curve as per prob.win(...) invoked with true ## performance values. Update the ranking estimate. rank.update <- function(oldrank, K, s = 1, fixed.opp = FALSE, outcome.mtx.update = FALSE) { if (fixed.opp == FALSE) { opponents <- sample.int(Nagents) } else { opponents <- fixed.opponents } # debugging exp.outcome <- prob.win(opponents, ranking = true.perf, s = s) assumed.outcome <- prob.win(opponents, ranking = oldrank, s = s) trial.outcome <- rbinom(exp.outcome, size = 1, prob = exp.outcome) if (outcome.mtx.update == TRUE) { for(n in 1:Nagents) { outcomes.matrix[n, opponents[n]] <<- outcomes.matrix[n, opponents[n]] + trial.outcome[n] } } oldrank + K * (trial.outcome - assumed.outcome) } ## Get order of vector elements (e.g. [4, 9, 5] gives [1, 3, 2]) get.ordering <- function(relative.vec) { match(relative.vec,sort(unique(relative.vec))) } gen.adaptive.kvec <- function(kmax, kmin, length, ramp.length) { k <- rep(kmin, length) k[1:ramp.length] <- seq(kmax, kmin, length = ramp.length) k } ## R is column-major, so want to have longer columns ranking.evolution <- array(0,c(Nmatches+1,Nagents)) ## To simulate the introduction of a new agent into the set and observe how it ## evolves, we will zero out an existing rank for some agent midway through ## the simulation. To do so, invoke the function as ## run.trials(simulate.new.agent = c(AGENT_NUM, ITERATION_NUM)) run.trials <- function(simulate.new.agent = NULL, outcome.mtx.update = FALSE) { if (!is.null(simulate.new.agent)) { agent.num <- simulate.new.agent[1] iteration.num <- simulate.new.agent[2] } if (use.adaptive.k == TRUE) { kvec <- gen.adaptive.kvec(10 * K, K, Nmatches + 1, 0.15 * Nmatches) } else { kvec <- gen.adaptive.kvec(K, K, Nmatches + 1, 1) } for(n in 2:(Nmatches+1)) { if (!is.null(simulate.new.agent) && (iteration.num == n)) { ranking.evolution[n-1,agent.num] <<- 0 } ranking.evolution[n,] <<- rank.update(ranking.evolution[n-1,],kvec[n],s,outcome.mtx.update=outcome.mtx.update) } } run.trials() ranking.evolution.ordering <- t(apply(ranking.evolution,1,get.ordering)) true.ordering <- get.ordering(true.perf) #+END_SRC * Analysis / assessment / figures of merit To analyze the performance of the simulation, we use the known true ranking values to compute the root-mean squared error, standard deviation of the estimates, and error of the mean. We compute these metrics for every agent, for every step in the simulation. Threshold _thr_ is used to ignore some initial period of time (before the estimators stated converging to the true value). We can then average these metrics over all agents to get some figures of merit for the overall simulation, or examine a particular agent to see e.g. whether outliers perform anomalously. #+NAME: elo_sim_analysis #+HEADER: :var idx=1 :var print="TRUE" :var plot.filename="conv1.png" :var plot.disp="FALSE" #+BEGIN_SRC R :results output replace ## assess the quality of the ranking: ## predict is a numeric vector; throw out the first thr percent of it, ## then get MSE and standard deviation for the rest based on the known ## value true (NB: true is a scalar value for one agent) get.error.indiv <- function(predict, true, thr = 0.15) { start <- 1 + ceiling(thr * length(predict)) predict <- predict[start:length(predict)] c(sqrt(mean((predict-true)^2)),sd(predict),abs(mean(predict)-true),mean(predict)-true) } ## est is a matrix with NROW agents, NCOL trials get.error.aggregate <- function(est, true, thr = 0.15) { mapply(get.error.indiv, split(est,col(est)), true, MoreArgs = list(thr)) } gen.sample.plots <- function(idx) { par(mfcol=c(2,1),oma=c(0,0,2,0)) plot(ranking.evolution.ordering[,idx],col=1,type="l") abline(h=true.ordering[idx],col=2) title(paste("Ordinal ranking evolution for agent ",idx,"; true value is ", true.ordering[idx], sep="")) plot(ranking.evolution[,idx],col=1,type="l") abline(h=true.perf[idx],col=2) title(paste("Absolute ranking evolution for agent ",idx,"; true value is ", format(true.perf[idx], digits=2),sep="")) } perf.check.ord <- get.error.aggregate(ranking.evolution.ordering, true.ordering, thr = 0.2) perf.check.val <- get.error.aggregate(ranking.evolution, true.perf, thr = 0.2) print.perf.check <- function(perf.check, ex.idx = idx) { cat("RMSE, sdev, abs(mean - true), mean - true: avg over agents\n") cat(rowMeans(perf.check),"\n") cat("RMSE, sdev, abs(mean - true), mean - true: agent",ex.idx,"\n") cat(perf.check[, ex.idx],"\n") } cat.fname.link <- function() { cat("[[file:",img.dir,"/",plot.filename,"]]\n",sep="") } if (print == TRUE) { cat("true.ordering[1:8]\n ",true.ordering[1:8],"\n") cat("true.perf[1:8]\n ",true.perf[1:8],"\n") cat("Performance check: ordinal rank estimate\n") print.perf.check(perf.check.ord, idx) cat("Performance check: underlying value estimate\n") print.perf.check(perf.check.val, idx) } if (plot.disp == TRUE) { X11(); gen.sample.plots(idx) } img.dir = "images" if (!file.exists(img.dir)) { dir.create(img.dir) } png(file=paste(img.dir,"/",plot.filename,sep="")) gen.sample.plots(idx) invisible(dev.off()) cat("[[file:",img.dir,"/",plot.filename,"]]\n",sep="") #+END_SRC #+RESULTS: elo_sim_analysis #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.380491 2.201938 0.7447513 5.817569e-16 RMSE, sdev, abs(mean - true), mean - true: agent 1 2.618611 2.184542 1.444125 1.444125 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1281709 0.06441655 0.1059347 0.02475723 RMSE, sdev, abs(mean - true), mean - true: agent 1 0.1899602 0.1097201 0.1550738 -0.1550738 [[file:images/conv1.png]] #+end_example * Time for experimentation! ** Rate of convergence, outliers, and the need for adaptive K ## org-babel tip: to run the source block, C-c C-c w/ point on the #+call line #+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="FALSE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="FALSE") Let's see what we got: #+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv1.png",plot.disp="FALSE") #+RESULTS: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv1.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.380491 2.201938 0.7447513 5.817569e-16 RMSE, sdev, abs(mean - true), mean - true: agent 1 2.618611 2.184542 1.444125 1.444125 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1281709 0.06441655 0.1059347 0.02475723 RMSE, sdev, abs(mean - true), mean - true: agent 1 0.1899602 0.1097201 0.1550738 -0.1550738 [[file:images/conv1.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+RESULTS: [[file:images/conv1.png]] [[file:images/conv1.png]] What does the diagnostics tell us: on average, the rankings are correct +/- 2 steps in a field of 200. The RMSE for agent 1 (esp. for the value estimate) is much higher than the standard deviation, suggesting it has problems converging to the correct number. If we look at the plot, we see that this is in fact the case. Notice now that Agent 1 is in the 95th percentile by rank, and that its true performance is 1.4*sigma ## To see percentiles, C-c C-c the following line and look at output in the ## session buffer (*R-babel*) or *Messages* buffer Now let's look at agent 8, which is closer to the 45th percentile. Does it converge faster? Yes: #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="conv2.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="conv2.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.380491 2.201938 0.7447513 5.817569e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 3.302878 3.227995 0.70025 -0.70025 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1281709 0.06441655 0.1059347 0.02475723 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.03239877 0.02879233 0.01485887 0.01485887 [[file:images/conv2.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+RESULTS: [[file:images/conv2.png]] Notice how RMSE and sdev metrics for Agent 8 are of the same magnitude and the ranking value appears to oscillate around its mean fairly early on. Hypothesis: increasing K will help the "outlier" rank estimate converge quicker: Rerun the simulation: #+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.1, distn="normal", use.adaptive.k="FALSE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.1, distn="normal", use.adaptive.k="FALSE") And look at the new plot: #+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv3.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv3.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 6.171757 5.975447 1.232069 0 RMSE, sdev, abs(mean - true), mean - true: agent 1 4.672165 4.625406 0.661375 0.661375 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1586351 0.1252956 0.0856625 0.05684161 RMSE, sdev, abs(mean - true), mean - true: agent 1 0.2457172 0.1537633 0.1916683 0.1916683 [[file:images/conv3.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+RESULTS: [[file:images/conv3.png]] Indeed, it reached the expected value quite quickly. But look at the RMSE and sdev error metrics compared to what we had before! They are much increased. The simplest solution to this is using adaptive K: large at the beginning, to get ballpark estimates for the ranks, and decrease it to refine the estimates. How well is this going to work? #+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv4.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="conv4.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.244224 2.065307 0.7269138 -5.462297e-16 RMSE, sdev, abs(mean - true), mean - true: agent 1 2.51217 2.081658 1.4065 1.4065 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.05051199 0.03320524 0.03300827 -0.01353418 RMSE, sdev, abs(mean - true), mean - true: agent 1 0.04825602 0.04000318 0.02699239 -0.02699239 [[file:images/conv4.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+RESULTS: [[file:images/conv4.png]] If you compare the performance of this run to that of the first one (with non-adaptive K), you see that our average performance value estimate is greatly improved. The average ordinal rank estimate, however, is improved only slightly. The conclusion we may draw is that the Elo procedure is rather robust with respect to estimating correct ordinal rank of an agent, even if the true performance value is way off. ** Adding a new agent to the corpus Now let's try to introduce a new agent into the corpus in the middle of simulation. The quick-and-dirty way to do it is to modify the run.trials function to re-zero a given agent's rank estimate midway through the simulation. So, let's run this code: #+begin_src R :results value silent run.trials(simulate.new.agent = c(agent.num=1, iteration.num=5000)) #+end_src #+call: elo_sim_analysis(idx=1,print="TRUE",plot.filename="sim_new_agent1.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=1,print="TRUE",plot.filename="sim_new_agent1.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.244224 2.065307 0.7269138 -5.462297e-16 RMSE, sdev, abs(mean - true), mean - true: agent 1 2.51217 2.081658 1.4065 1.4065 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.05889624 0.03381955 0.04462162 -0.02583388 RMSE, sdev, abs(mean - true), mean - true: agent 1 0.2380088 0.2010747 0.1273665 -0.1273665 [[file:images/sim_new_agent1.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+RESULTS: [[file:images/sim_new_agent1.png]] Notice that we picked the "outlier" Agent 1, and even though we are using our adaptive K strategy, by the time the re-zeroing happens, K is already fixed at 0.01, the lowest value. Nonetheless, it starts approaching the true value rather rapidly (which isn't too surprising, since the ranking updates of this agent are rather precise, given that approximate true rank of other agents is already known). In a "production" algorithm, of course, adaptive K should be used with any new agents that are introduced. ** Effects of preference sensitivity (S) *** What is the significance of S anyhow? Here's one way to find out: #+NAME: s_significance #+BEGIN_SRC R :results output explore.s.wrt.quantiles <- function(dist, s, nq = 20) { tmp <- rev(quantile(dist,probs=seq(0,1,len=nq+1))) pairs <- cbind(tmp,c(tmp[2:length(tmp)],NA)) apply(pairs,1,function(x) prob.win.match(x[1],x[2],s=s)) } rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,5), s = 1, nq = 10))) rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 5, nq = 10))) rowMeans(replicate(1000,explore.s.wrt.quantiles(runif(200,0,1), s = 5, nq = 10))) rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 1, nq = 10))) rowMeans(replicate(1000,explore.s.wrt.quantiles(rnorm(200,0,1), s = 15, nq = 10))) #+END_SRC #+call: s_significance() #+results: s_significance() #+begin_example 100% 90% 80% 70% 60% 50% 40% 30% 0.9976393 0.8882985 0.8244165 0.7875210 0.7742437 0.7765346 0.7906393 0.8230313 20% 10% 0% 0.8931979 0.9974469 NA 100% 90% 80% 70% 60% 50% 40% 30% 0.9975603 0.8888554 0.8225140 0.7891638 0.7789783 0.7748193 0.7884782 0.8223398 20% 10% 0% 0.8892782 0.9977914 NA 100% 90% 80% 70% 60% 50% 40% 30% 0.6203402 0.6201246 0.6206393 0.6212999 0.6211669 0.6202980 0.6215165 0.6215658 20% 10% 0% 0.6212765 0.6217228 NA 100% 90% 80% 70% 60% 50% 40% 30% 0.8062400 0.6064312 0.5774024 0.5672994 0.5618731 0.5624628 0.5667661 0.5773851 20% 10% 0% 0.6059768 0.8063105 NA 100% 90% 80% 70% 60% 50% 40% 30% 0.9999995 0.9968155 0.9860055 0.9764557 0.9705964 0.9707642 0.9766836 0.9864255 20% 10% 0% 0.9965347 0.9999995 NA #+end_example Here we took some assumed true performance distribution and look at the expected result of prob.win.match for agents that are in adjacent deciles. The outcome depends on the underlying distribution, on the decile considered (unless the dist'n is uniform), and on the value of S. So, for S=1, for a normal distribution of attractiveness, if you mach up someone from the 6th and 5th deciles, the more attractive person will win 56% of the time. For S=5, the value is 78% of the time. For a uniform distribution w/ S=5, it's 62%. So what is the more sensible behavior of S for the case of comparing attractiveness? Well, that's not very clear, but it's a rather interesting question. Based on my personal guess of how S should behave, I think that the true distribution of attractiveness is somewhere in between the normal and the uniform. Also, we can probably conclude that S=1 is too low, and S=15 is way too high. Finally, note (cf the first two lines of the output) that manipulating S is _the same_ as manipulating the variance of the underlying performance distribution. In other words, in this model, it is this distribution that determines the effective preference sensitivities. S is just more convenient to work with (as opposed to regenerating the performance numbers w/ a different variance). Let's see what the typical simulation curves look like for S={1,5,15} for the normal distribution. *** S = 1: #+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s1-norm.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s1-norm.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 4.333445 3.947019 1.517871 5.906386e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 5.33335 5.128124 1.466375 -1.466375 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.0837952 0.06482169 0.04910551 -0.04870657 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.09070457 0.05776304 0.06993688 -0.06993688 [[file:images/s1-norm.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src *** S = 5 #+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="normal", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s5-norm.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s5-norm.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 2.244224 2.065307 0.7269138 -5.462297e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 3.24332 3.156469 0.746375 -0.746375 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.05051199 0.03320524 0.03300827 -0.01353418 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.04412241 0.03175029 0.03064038 -0.03064038 [[file:images/s5-norm.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src *** S = 15 #+call: elo_sim(Nagents=200, Nmatches=10000, s=15, K=0.01, distn="normal", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=15, K=0.01, distn="normal", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s15-norm.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="s15-norm.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 1.549534 1.421335 0.5199238 -5.32907e-17 RMSE, sdev, abs(mean - true), mean - true: agent 8 1.856845 1.833074 0.296875 -0.296875 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.2281020 0.02855655 0.2249102 0.04291445 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.04505188 0.01655123 0.04190182 0.04190182 [[file:images/s15-norm.png]] #+end_example *** Discussion: Wow, this is pretty interesting. Notice how _ordinal rank_ gets more and more precise with increasing S, but for S that's too high (e.g. S=15), the _value rank_ estimate can converge to the wrong value! Notice how average RMSE is large but average sdev is small: that means that it's oscillating about something other than the value we expect. This is probably due to the fact that if the system is overly sensitive, i.e. virtually any advantage results in an almost sure win, backing out the true performance becomes hard. ** What about a different underlying distribution? *** Uniform #+call: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="uniform", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=5, K=0.01, distn="uniform", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s5.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s5.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 179 187 54 167 120 94 141 25 true.perf[1:8] 0.4148060 0.4370754 -0.2138605 0.3304476 0.1417455 0.01909595 0.2365883 -0.3653334 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 6.638032 6.30027 1.726002 5.417888e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 6.039774 5.850301 1.502375 1.502375 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1170928 0.03192453 0.1125994 0.1125994 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.1157774 0.02714986 0.1125495 0.1125495 [[file:images/unif-s5.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src #+call: elo_sim(Nagents=200, Nmatches=10000, s=10, K=0.01, distn="uniform", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=10, K=0.01, distn="uniform", use.adaptive.k="TRUE") #+call: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s10.png",plot.disp="FALSE") #+results: elo_sim_analysis(idx=8,print="TRUE",plot.filename="unif-s10.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 179 187 54 167 120 94 141 25 true.perf[1:8] 0.4148060 0.4370754 -0.2138605 0.3304476 0.1417455 0.01909595 0.2365883 -0.3653334 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 4.886145 4.625853 1.313754 -1.154632e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 5.837786 5.317555 2.40975 2.40975 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.03199771 0.02303463 0.02180407 0.02180407 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.03552381 0.02598946 0.02421929 0.02421929 [[file:images/unif-s10.png]] #+end_example Setting S=10 for the uniform dist'n results in a preference between adjacent deciles similar to that of the normal dist'n with S=5. It seems to give a more precise estimate for the rank value. It's odd, though, that we are observing a strong bias in the rank value estimation for S that is too low (S=5, in this case) -- with the uniform dist'n we saw this behavior when S was too high. *** Lognormal distribution Lognormal is skewed; in our situation it corresponds to a hard cut-off of unattractiveness and a long-ish tail of very attractive people #+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="lognormal", use.adaptive.k="TRUE") #+results: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="lognormal", use.adaptive.k="TRUE") #+call: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="lognorms1.png",plot.disp="FALSE") #+results: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="lognorms1.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 2.407417 -0.9631754 -0.09388662 0.351286 -0.03350121 -0.6323944 3.001919 -0.622024 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 5.292368 4.862564 1.698502 -5.218048e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 6.456528 5.937467 2.53725 -2.53725 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.1326077 0.06758075 0.08547004 -0.06061546 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.06518846 0.06219284 0.01954661 -0.01954661 [[file:images/lognorms1.png]] #+end_example #+begin_src R :results output raw :exports results cat.fname.link() #+end_src We can see that the estimates converge pretty well, and the results do look like a lognormal distribution: #+BEGIN_SRC R :eval no-export par(mfrow=c(2,1)) hist(true.perf,breaks=30) hist(ranking.evolution[10000,],breaks=30) #+END_SRC * The Bradley-Terry R package ** Background I stumbled upon a very recent (i.e. last version was released < 2 weeks ago) R package that solves paired comparison problems (also known, apparently, as Bradley-Terry problems) using MLE model fitting: http://cran.r-project.org/web/packages/BradleyTerry2/ I'm obviously very curious about the implementation of the MLE approach, and its advantages/drawbacks compared to the Elo algorithm. I hope to explore this when I have more time. For the time being, though, let's quickly modify the code to be compatible with the BradleyTerry2 package (for this, add a boolean flag to run.trials that populates the Nagents x Nagents matrix global variable outcomes.matrix; there, the element ij gives the number of wins of Agent i over Agent j). We'll compare the results with the Elo answer. ** Re-run the simulation, and run the Bradley-Terry version #+call: elo_sim(Nagents=200, Nmatches=10000, s=1, K=0.01, distn="normal", use.adaptive.k="TRUE") #+call: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="bt_comp_norms1.png",plot.disp="FALSE") #+results: elo_sim_analysis[:results output replace](idx=8,print="TRUE",plot.filename="bt_comp_norms1.png",plot.disp="FALSE") #+begin_example true.ordering[1:8] 184 53 131 150 134 92 191 93 true.perf[1:8] 1.370958 -0.5646982 0.3631284 0.6328626 0.4042683 -0.1061245 1.511522 -0.09465904 Performance check: ordinal rank estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 4.333445 3.947019 1.517871 5.906386e-16 RMSE, sdev, abs(mean - true), mean - true: agent 8 5.33335 5.128124 1.466375 -1.466375 Performance check: underlying value estimate RMSE, sdev, abs(mean - true), mean - true: avg over agents 0.07745738 0.0663651 0.0353229 -0.03187495 RMSE, sdev, abs(mean - true), mean - true: agent 8 0.07011404 0.06783307 0.01775465 -0.01775465 [[file:images/bt_comp_norms1.png]] #+end_example #+begin_src R :results output silent run.trials(outcome.mtx.update=TRUE) #+end_src #+NAME: bt_package_eval #+BEGIN_SRC R :var display.hist="FALSE" #install.packages("BradleyTerry2") library(BradleyTerry2) rownames(outcomes.matrix) <- as.character(1:Nagents) colnames(outcomes.matrix) <- as.character(1:Nagents) pair.wins.df <- countsToBinomial(outcomes.matrix) attrModel <- BTm(cbind(win1, win2), player1, player2, ~player, id = "player", data = pair.wins.df) bt.rankings <- c(0, attrModel$coefficients) # need to de-mean them # player 1 by convention is assigned rank 0 since only relative rank matters bt.rankings <- bt.rankings - mean(bt.rankings) if(display.hist == TRUE) { oldpar <- par() par(mfrow=c(3,1)) hist(true.perf,breaks=30) hist(ranking.evolution[10000,],breaks=30) hist(bt.rankings,breaks=30) par(oldpar) } bt.ordering <- get.ordering(bt.rankings) elo.ordering <- ranking.evolution.ordering[10000,] elo.ordering.mean <- colMeans(ranking.evolution.ordering[7000:10000,]) cat("Errors in the orderings: Elo, Elo mean over 3000 iter, BT\n") bt.err.cmp.fn <- function(cmp, ann="", true=true.ordering) { cat(ann, rowMeans(mapply(get.error.indiv, cmp, true, MoreArgs=list(0))), "\n") } invisible(mapply(bt.err.cmp.fn, list(elo.ordering, elo.ordering.mean, bt.ordering) , list("Elo:", "Elo mean:", "BT:"), MoreArgs=list(true=true.ordering))) cat("Errors in the rankings: Elo, Elo mean over 3000 iter, BT\n") invisible(mapply(bt.err.cmp.fn, list(ranking.evolution[10000,], colMeans(ranking.evolution[7000:10000,]), bt.rankings) , list("Elo:", "Elo mean:", "BT:"), MoreArgs=list(true=true.perf))) #+END_SRC #+call: bt_package_eval[:results output replace](display.hist="FALSE") Ooh, look at that! (Btw, only the first entry in each line matters). You can see that B-T algorithm beats even the averaged Elo (the RMSE values are ~ 1, 2, and 3 for the ordinal estimate, and 0.03, 0.03, and 0.05 for the rank value estimate). The trade-off is that it seems to take a little longer to run. * Some conclusions and further questions to ponder ** Conclusions from the simulation 1. Simple Elo algorithm works, and works pretty well under many circumstances. However, its performance (under the logit pair preference assumption) seems to depend rather strongly on the variance of the true underlying distribution (as represented here by the S parameter). 2. R is a very convenient tool or exploring this problem. In particular, the simulation process adopted here allows very easily to estimate both the rate of convergence of the estimator to its expected value, its oscillations about the mean value, and any bias it might have (by comparing RMSE against the sdev of the estimator over time). 3. Depending the underlying dist'n and its variance, the typical error of the ordinal rank estimation is on the order of 1-3% (3% corresponding to 6 places out of 200). The performance value estimation was typically in the range of 0.03 -- 0.08 sigma (the dist'n standard deviation), but could reach high values, e.g. 0.2 sigma, if estimators didn't converge to the correct mean. Curiously, even in this case, the ordinal rank estimation error was very low. 4. The only enhancement that is really desirable over the simplest Elo implementation is the need for adaptive K -- that is, the ranking score updates by larger increments in the beginning to achieve fast ballpark estimates, then K is dialed down to help maintain precision. 5. When new agent is introduced into the corpus, it converges to its true ranking quiet rapidly. 6. It turns out that this class of problems is well known in the statistics community (e.g. google for Mark Glickman's papers). The BradleyTerry2 R package provides an MLE-based solution to this problem and outperforms my Elo implementation in accuracy (but not in speed!). ** If we had a real algorithm running ... then it might provide an source of some pretty interesting data analysis. What is the actual distribution of attractiveness? How does the decile sensitivity study above look for the actual data? What is the variance in people's assessment when you have different people evaluating the same pictures? How do these answers change depending on the image corpus (e.g. all girls, all guys, all goats, random images of stuff, animals from cuteoverload)? Also, an interesting thing to study would be the effect of deliberately showing pictures that are close together in ranking vs showing ones that have disparate rank. Which one would lead to more user engagement? More accurate set of rankings? ** Final questions to ponder: How do we election what pictures to show next (or, more generally, how do we select the next agents for pairwise comparison)? In this simulation, the strategy was completely random. What do we gain with more fine-grained control? One possibility is that we can converge to the true rankings and values quicker if we first only work with a small subset of the agents, establishing accurate rankings for them, and then introduce additional agents into the corpus gradually. Indeed, we saw that in this case the ranking of a new agent takes much less time to establish. Another interesting question to consider is the fact that each user's evaluation of attractiveness is going to be different. Can we make things more precise by clustering the ratings by user, knowing that within each cluster the rating scale is the same? (Probably not going to make a huge impact if the number of users is large.) ** Further work Things that I should be do at some point: - Look at the R implementation of the Bradley-Terry MLE approach - Skim through Glickman's papers to (maybe) get a better feel for the subject Things that I probably won't get around to exploring, but that I wish I understood: - Rank updates of the Elo algorithm generate a random process. It would be nice to know some of its mathematical characteristics. For instance, can it be related to other mean-reverting stochastic processes in discrete time? I've read somewhere that the process is not stationary (and various statements to the effect that Elo scores drift over time). This is not what I'm empirically observing in the simulation. Why not, and what consequences does non-stationarity have? - Empirically, in my simulations, some underlying probability distributions result in the estimator converging to incorrect rank values (while the ordinal rank estimation remains very good). I provided a hand-waving explanation, but I'm not sure it's correct. It would be nice to know exactly what's going on.