How To Sort By Average Rating

Evan Miller’s well-known How Not To Sort By Average Rating points out problems with ranking by “wrong solution #1” (by differences, upvotes minus downvotes) and “wrong solution #2” (by average ratings, upvotes divided by total votes). Miller’s “correct solution” is to use the lower bound of a Wilson score confidence interval for a Bernoulli parameter. I think it would probably be better to use Laplace smoothing, because:

  • Laplace smoothing is much easier
  • Laplace smoothing is not always negatively biased

This is the Wilson scoring formula given in Miller’s post, which we’ll use to get 95% confidence interval lower bounds:

rating equation

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

Now here’s the formula for doing Laplace smoothing instead:

(upvotes + \alpha) / (total votes + \beta)

Here \alpha and \beta are parameters that represent our estimation of what rating is probably appropriate if we know nothing else (cf. Bayesian prior). For example, \alpha = 1 and \beta = 2 means that a post with no votes gets treated as a 0.5.

The Laplace smoothing method is much simpler to calculate – there’s no need for statistical libraries, or even square roots!

Does it successfully solve the problems of “wrong solution #1” and “wrong solution #2”? First, the problem with “wrong solution #1”, which we might summarize as “the problem with large sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 209 50 159 0.81 0.7545 0.80
second item 118 25 93 0.83 0.7546 0.82

All the methods agree except for “wrong solution #1” that the second item should rank higher.

Then there’s the problem with “wrong solution #2”, which we might summarize as “the problem with small sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 1 0 1 1.0 0.21 0.67
second item 534 46 488 0.92 0.90 0.92

All the methods agree except for “wrong solution #2” that the second item should rank higher.

How similar are the results for the Wilson method and the Laplace method overall? Take a look: here color encodes the score, with blue at 0.0, white at 0.5, and red at 1.0:

plot of Wilson and Laplace methods

They’re so similar, you might say, that you would need a very good reason to justify the complexity of the calculation for the Wilson bound. But also, the differences favor the Laplace method! The Wilson method, because it’s a lower bound, is negatively biased everywhere. It’s certainly not symmetrical. Let’s zoom in:

plot of Wilson and Laplace methods - zoomed

With the Wilson method, you could have three upvotes, no downvotes and still rank lower than an item that is disliked by 50% of people over the long run. That seems strange.

The Laplace method does have its own biases. By choosing \alpha=1 and \beta=2, the bias is toward 0.5, which I think is reasonable for a ranking problem like this. But you could change it: \alpha=0 with \beta=1 biases toward zero, \alpha=1 with \beta=0 biases toward one. And \alpha=100 with \beta=200 biases toward 0.5 very strongly. With the Wilson method you can tune the size of the interval, adjusting the confidence level, but this only adjusts how strongly you’re biased toward zero.

Here’s another way of looking at the comparison. How do the two methods compare for varying numbers of upvotes with a constant number (10) of downvotes?

Wilson and Laplace methods again

Those are similar curves. Not identical – but is there a difference to justify the complexity of the Wilson score?

In conclusion: Just adding a little bit to your numerators and denominators (Laplace smoothing) gives you a scoring system that is as good or better than calculating Wilson scores.

[code for this post]

Bayes’ Rule for Ducks

You look at a thing.

duck?

Is it a duck?

Re-phrase: What is the probability that it’s a duck, if it looks like that?

Bayes’ rule says that the probability of it being a duck, if it looks like that, is the same as the probability of any old thing being a duck, times the probability of a duck looking like that, divided by the probability of a thing looking like that.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This makes sense:

  • If ducks are mythical beasts, then Pr(duck) (our “prior” on ducks) is very low, and the thing would have to be very duck-like before we’d believe it’s a duck. On the other hand, if we’re at some sort of duck farm, then Pr(duck) is high and anything that looks even a little like a duck is probably a duck.
  • If it’s very likely that a duck would look like that (Pr(looks|duck) is high) then we’re more likely to think it’s a duck. This is the “likelihood” of a duck looking like that thing. In practice it’s based on how the ducks we’ve seen before have looked.
  • The denominator Pr(looks) normalizes things. After all, we’re in some sense portioning out the probabilities of this thing being whatever it could be. If 1% of things look like this, and 1% of things look like this and are ducks, then 100% of things that look like this are ducks. So Pr(looks) is what we’re working with; it’s the denominator.

Here’s an example of a strange world to test this in:

ducks

There are ten things. Six of them are ducks. Five of them look like ducks. Four of them both look like ducks and are ducks. One thing looks like a duck but is not a duck. Maybe it’s a fake duck? Two ducks do not look like ducks. Ducks in camouflage. Test the equality of the two sides of Bayes’ rule:

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

\displaystyle \frac{4}{5} = \frac{\frac{6}{10} \cdot \frac{4}{6}}{\frac{5}{10}}

It’s true here, and it’s not hard to show that it must be true, using two ways of expressing the probability of being a duck and looking like a duck. We have both of these:

\displaystyle Pr(duck \cap looks) = Pr(duck|looks) \cdot Pr(looks)

\displaystyle Pr(duck \cap looks) = Pr(looks|duck) \cdot Pr(duck)

Check those with the example as well, if you like. Using the equality, we get:

\displaystyle Pr(duck|looks) \cdot Pr(looks) = Pr(looks|duck) \cdot Pr(duck)

Then dividing by Pr(looks) we have Bayes’ rule, as above.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This is not a difficult proof at all, but for many people the result feels very unintuitive. I’ve tried to explain it once before in the context of statistical claims. Of course there’s a wikipedia page and many other resources. I wanted to try to do it with a unifying simple example that makes the equations easy to parse, and this is what I’ve come up with.

What’s the difference between Bayesian and non-Bayesian statistics?

A coin is flipped and comes up heads five times in a row. Is it a fair coin?

Whether you trust a coin to come up heads 50% of the time depends a good deal on who’s flipping the coin. If you’re flipping your own quarter at home, five heads in a row will almost certainly not lead you to suspect wrongdoing. At a magic show or gambling with a shady character on a street corner, you might quickly doubt the balance of the coin or the flipping mechanism.

What is often meant by non-Bayesian “classical statistics” or “frequentist statistics” is “hypothesis testing”: you state a belief about the world, determine how likely you are to see what you saw if that belief is true, and if what you saw was a very rare thing to see then you say that you don’t believe the original belief. That original belief about the world is often called the “null hypothesis”.

Our null hypothesis for the coin is that it is fair – heads and tails both come up 50% of the time. If that’s true, you get five heads in a row 1 in 32 times. That’s 3.125% of the time, or just 0.03125, and this sort of probability is sometimes called a “p-value”. If the value is very small, the data you observed was not a likely thing to see, and you’ll “reject the null hypothesis”. The cutoff for smallness is often 0.05. So the frequentist statistician says that it’s very unlikely to see five heads in a row if the coin is fair, so we don’t believe it’s a fair coin – whether we’re flipping nickels at the national reserve or betting a stranger at the bar.

Say a trustworthy friend chooses randomly from a bag containing one normal coin and two double-headed coins, and then proceeds to flip the chosen coin five times and tell you the results. When would you be confident that you know which coin your friend chose? If a tails is flipped, then you know for sure it isn’t a coin with two heads, of course. But what if it comes up heads several times in a row? When would you say that you’re confident it’s a coin with two heads?

If you stick to hypothesis testing, this is the same question and and the answer is the same: reject the null hypothesis after five heads.

Notice that when you’re flipping a coin you think is probably fair, five flips seems too soon to question the coin. But when you know already that it’s twice as likely that you’re flipping a coin that comes up heads every time, five flips seems like a long time to wait before making a judgement. The non-Bayesian approach somehow ignores what we know about the situation and just gives you a yes or no answer about trusting the null hypothesis, based on a fairly arbitrary cutoff.

The Bayesian approach to such a question starts from what we think we know about the situation. This is called a “prior” or “prior distribution”. In the case of the coins, we understand that there’s a \frac{1}{3} chance we have a normal coin, and a \frac{2}{3} chance it’s a two-headed coin.

The Bayesian next takes into account the data observed and updates the prior beliefs to form a “posterior” distribution that reports probabilities in light of the data. The updating is done via Bayes’ rule, hence the name. In Gelman’s notation, this is:

\displaystyle p(\theta|y) = \frac{p(\theta)p(y|\theta )}{p(y)}

For our example, this is: “the probability that the coin is fair, given we’ve seen some heads, is what we thought the probability of the coin being fair was (the prior) times the probability of seeing those heads if the coin actually is fair, divided by the probability of seeing the heads at all (whether the coin is fair or not)”. This is true.

So say our friend has announced just one flip, which came up heads. Back with the “classical” technique, the probability of that happening if the coin is fair is 50%, so we have no idea if this coin is the fair coin or not. With Bayes’ rule, we get the probability that the coin is fair is \frac{\frac{1}{3} \cdot \frac{1}{2}}{\frac{5}{6}}. (Conveniently, that p(y) in the denominator there, which is often difficult to calculate or otherwise know, can often be ignored since any probability that we calculate this way will have that same denominator.) In our case here, the answer reduces to just \frac{1}{5} or 20%. There’s an 80% chance after seeing just one heads that the coin is a two-headed coin. After four heads in a row, there’s 3% chance that we’re dealing with the normal coin. Notice that even with just four flips we already have better numbers than with the alternative approach and five heads in a row. And the Bayesian approach is much more sensible in its interpretation: it gives us a probability that the coin is the fair coin. With the earlier approach, the probability we got was a probability of seeing such results if the coin is a fair coin – quite different and harder to reason about.

It’s tempting at this point to say that non-Bayesian statistics is statistics that doesn’t understand the Monty Hall problem. But of course this example is contrived, and in general hypothesis testing generally does make it possible to compute a result quickly, with some mathematical sophistication producing elegant structures that can simplify problems – and one is generally only concerned with the null hypothesis anyway, so there’s in some sense only one thing to check. The Bayesian formulation is more concerned with all possible permutations of things, and it can be more difficult to calculate results, as I understand it – especially difficult to come up with closed forms for things. There again, the generality of Bayes does make it easier to extend it to arbitrary problems without introducing a lot of new theory.

The example with the coins is discrete and simple enough that we can actually just list every possibility. In general this is not possible, of course, but here it could be helpful to see and understand that the results we get from Bayes’ rule are correct, verified diagrammatically:

diagram of coin scenarios

Here tails are in grey, heads are in black, and paths of all heads are in bold. You can see, for example, that of the five ways to get heads on the first flip, four of them are with double-heads coins.

I’m thinking about Bayesian statistics as I’m reading the newly released third edition of Gelman et al.’s Bayesian Data Analysis, which is perhaps the most beautiful and brilliant book I’ve seen in quite some time. The example here is logically similar to the first example in section 1.4, but that one becomes a real-world application in a way that is interesting and adds detail that could distract from what’s going on – I’m sure it complements nicely the traditional abstract coin-flipping probability example here. I’ll also note that I may have over-simplified the hypothesis testing side of things, especially since the coin-flipping example has no clear idea of what is more extreme (all tails is as unlikely as all heads, etc.), there was no experiment design or reasoning about that side of things, and so on. I think the characterization is largely correct in outline, and I welcome all comments!

Polya’s How to solve it: Quotes and comments

The first rule of style is to have something to say.

solveit

I wish I had read How to Solve It when I was a fairly young student of mathematics. I wish also that I had read it when I was becoming a teacher of mathematics. It has been recommended many times, and now I will recommend it also. Really first rate stuff. Everyone’s excited these days about Thinking, Fast and Slow, but if you’re moving into real problem-solving that necessitates “slow” mental work, it’s Polya who really has something to say. His comments extend beyond the core problem-solving theses as well. I include here some quotes of particular note.

Mathematics is interesting in so far as it occupies our reasoning and inventive powers.

 

Can our knowledge in mathematics be based on formal proofs alone? … It is certain that your knowledge, or my knowledge, or your students’ knowledge in mathematics is not based on formal proofs alone. If there is any solid knowledge at all, it has a broad experimental basis, and this basis is broadened by each problem whose result is successfully tested.

 

Definitions in dictionaries are not very much different from mathematical definitions in the outward form but they are written in a different spirit.

The writer of a dictionary is concerned with the current meaning of the words. He accepts, of course, the current meaning and states it as neatly as he can in form of a definition.

The mathematician is not concerned with the current meaning of his technical terms, at least not primarily concerned with that. What “circle” or “parabola” or other technical terms of this kind may or may not denote in ordinary speech matters little to him. The mathematical definition creates the mathematical meaning.

 

Teaching to solve problems is education of the will.

 

Another “problem to prove” is to “prove the theorem of Pythagoras.” We do not say: “Prove or disprove the theorem of Pythagoras.” It would be better in some respects to include in the statement of the problem the possibility of disproving, but we may neglect it, because we know that the chances for disproving the theorem of Pythagoras are rather slight.

 

If the student failed to get acquainted with this or that particular geometric fact, he did not miss so much; he may have little use for such facts in later life. But if he failed to get acquainted with geometric proofs, he missed the best and simplest examples of true evidence and he missed the best opportunity to acquire the idea of strict reasoning. Without this idea, he lacks a true standard with which to compare alleged evidence of all sorts aimed at him in modern life.

 

[Not all mathematical theorems can be split naturally into hypothesis and conclusion. Thus, it is scarcely possible to split so the theorem: “There are an infinity of prime numbers.”]

That last is one place where I may disagree with Polya. The split is easy and natural: “Assuming our usual definitions (hypothesis) there are an infinity of prime numbers (conclusion).” It’s easy to forget that what we consider natural, as “the natural numbers” and so on, is just a made-up mathematical system when considered formally. Where mathematics comes from is a good exposition on this sort of thing, I think. And here, in a bit of a contrast, is perhaps my favorite quasi-philosophical quote from Polya’s book:

Do not believe anything but doubt only what is worth doubting.

Perplexity: what it is, and what yours is

[Test your perplexity!]

Perplexity is an interesting measure of how well you’re predicting something. It has a nice mathy definition; I’ll try to describe it quite simply here.

Say the real thing you want to predict is the sequence of numbers from one to six. If you predict each number in turn with a six-sided die, you will be right about one sixth of the time. The die’s perplexity with this data (and really with any data it could possibly predict – dice are dumb) is six.

The perplexity of whatever you’re evaluating, on the data you’re evaluating it on, sort of tells you “this thing is right about as often as an x-sided die would be.”

Note that a human, after seeing that the first number is 1 and the second is 2, might guess that the third is 3, and so on. So a human is a better predictor than a die, and would have a lower perplexity. So would a linear model trained on the data seen so far, in this case. That is, whatever you’re evaluating (a model, or whatever) is allowed to see the data so far before it predicts what the next thing is, and that is frequently helpful.

This is interesting and easily calculable for predicting categorical things like letters, or words. Maybe you’re a computational linguist and you want to measure how well your model can predict the next English word in a book. Humans are pretty good at this, by the way. (“The early bird gets the ___?”) Computers can achieve a perplexity of 247, apparently – like having a 247-sided die predict each subsequent word. (The computer “makes a new die” for each prediction, based on the words so far.)

You can also predict one letter at a time, and you’ll get a much better perplexity score, even just since there are many fewer different letters than different words. This has been done, even, as described in this interesting article about Andrei Kolmogorov:

To measure the artistic merit of texts, Kolmogorov also employed a letter-guessing method to evaluate the entropy of natural language. … His group conducted a series of experiments, showing volunteers a fragment of Russian prose or poetry and asking them to guess the next letter, then the next, and so on.

I know you’re itching to participate in such an experiment – and you can! On the linked page you’ll be able to try your hand. It’s interesting to consider whether you’re evaluating the artistic merit of the text, or your own knowledge of English, or some combination, or what.

Computers can predict letters pretty well – a perplexity of about 3.4. And that’s predicting from the range of all printable characters (well, ASCII characters). This human test only asks you to predict from a range of about 30 characters, which is a bit easier. So I hope you can beat 3.4!

[Test your perplexity!]

Thanks to David for originally leading me to think about perplexity!

e is the best base

Here’s a fun little problem: What list of non-negative integers that sum to 25 will give you the biggest product when multiplied together? You can think of it as breaking 25 into pieces (additively) and then multiplying the pieces – and it’s that product that you want to be large. So the list of twenty-five ones is pretty bad, because 1^{25}=1. The list of ten and fifteen is better, because 10 \cdot 15 = 150. The list of five fives is better yet, because 5^5 = 3125. Can you do better?

(Think about it for a minute if you want.)

Okay, the answer is this: 2^2 \cdot 3^7 = 8748. It’s the list of two two’s (or one four, it doesn’t matter) and seven threes.

I noticed all these threes and remembered something I once heard in a computer science course, about how three is really the optimal base in terms of balancing number of unique digit symbols and number of digits needed to store a given number. (Binary, base two, has “too few” symbols – just 0 and 1 – which means lots of digits needed to store a big number, while decimal, base ten, has “too many” symbols – ten of them – but correspondingly uses fairly few digits to represent even largish numbers.) Of course it wasn’t really three that was best, it was e (wiki), which is around 2.71. And of course computers use binary anyway, so it was kind of beside the point. But I remembered this statement, which I never saw an explanation for or any other mention of.

Now I offer that the same sort of thing is going on here, and that you will be able to correctly answer any question like the above (not just for 25) by using as many threes as possible and then being smart with what’s left over. And it’s because 3 is close to e!

If we relax the rules about whole numbers and agree that we want to use all the same number (convince yourself) then the problem is now for whatever starting number N (the 25 from the original problem) we want to find the x' that maximizes \left(\frac{N}{x'}\right)^{x'} or equivalently (with x = \frac{N}{x'}) find the x that maximizes x^{\left(\frac{N}{x}\right)}. Using the standard method for maximizing, the N quickly drops out and soon we have shown that x = e. (Thanks Joe!)

The way I first investigated this was by choosing 100 and using R to try a bunch of numbers and plot the results. The products get large and I don’t care much about their absolute values, and the curvature changes if you use numbers other than 100 to start from, but the graph always looks basically like this (code):

Rplot

So now we have both real math (calculus) and computational-experiment demonstrations of yet another way that e is cool! I don’t know if it’s worth adding to this list, but I like it.