# How To Sort By Average Rating

Evan Miller’s well-known How Not To Sort By Average Rating points out problems with ranking by “wrong solution #1” (by differences, upvotes minus downvotes) and “wrong solution #2” (by average ratings, upvotes divided by total votes). Miller’s “correct solution” is to use the lower bound of a Wilson score confidence interval for a Bernoulli parameter. I think it would probably be better to use Laplace smoothing, because:

• Laplace smoothing is much easier
• Laplace smoothing is not always negatively biased

This is the Wilson scoring formula given in Miller’s post, which we’ll use to get 95% confidence interval lower bounds:

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

Now here’s the formula for doing Laplace smoothing instead:

(upvotes + $\alpha$) / (total votes + $\beta$)

Here $\alpha$ and $\beta$ are parameters that represent our estimation of what rating is probably appropriate if we know nothing else (cf. Bayesian prior). For example, $\alpha = 1$ and $\beta = 2$ means that a post with no votes gets treated as a 0.5.

The Laplace smoothing method is much simpler to calculate – there’s no need for statistical libraries, or even square roots!

Does it successfully solve the problems of “wrong solution #1” and “wrong solution #2”? First, the problem with “wrong solution #1”, which we might summarize as “the problem with large sample sizes”:

 upvotes downvotes wrong #1 wrong #2 Wilson Laplace first item 209 50 159 0.81 0.7545 0.80 second item 118 25 93 0.83 0.7546 0.82

All the methods agree except for “wrong solution #1” that the second item should rank higher.

Then there’s the problem with “wrong solution #2”, which we might summarize as “the problem with small sample sizes”:

 upvotes downvotes wrong #1 wrong #2 Wilson Laplace first item 1 0 1 1.0 0.21 0.67 second item 534 46 488 0.92 0.90 0.92

All the methods agree except for “wrong solution #2” that the second item should rank higher.

How similar are the results for the Wilson method and the Laplace method overall? Take a look: here color encodes the score, with blue at 0.0, white at 0.5, and red at 1.0:

They’re so similar, you might say, that you would need a very good reason to justify the complexity of the calculation for the Wilson bound. But also, the differences favor the Laplace method! The Wilson method, because it’s a lower bound, is negatively biased everywhere. It’s certainly not symmetrical. Let’s zoom in:

With the Wilson method, you could have three upvotes, no downvotes and still rank lower than an item that is disliked by 50% of people over the long run. That seems strange.

The Laplace method does have its own biases. By choosing $\alpha=1$ and $\beta=2$, the bias is toward 0.5, which I think is reasonable for a ranking problem like this. But you could change it: $\alpha=0$ with $\beta=1$ biases toward zero, $\alpha=1$ with $\beta=0$ biases toward one. And $\alpha=100$ with $\beta=200$ biases toward 0.5 very strongly. With the Wilson method you can tune the size of the interval, adjusting the confidence level, but this only adjusts how strongly you’re biased toward zero.

Here’s another way of looking at the comparison. How do the two methods compare for varying numbers of upvotes with a constant number (10) of downvotes?

Those are similar curves. Not identical – but is there a difference to justify the complexity of the Wilson score?

In conclusion: Just adding a little bit to your numerators and denominators (Laplace smoothing) gives you a scoring system that is as good or better than calculating Wilson scores.

# Bayes’ Rule for Ducks

You look at a thing.

Is it a duck?

Re-phrase: What is the probability that it’s a duck, if it looks like that?

Bayes’ rule says that the probability of it being a duck, if it looks like that, is the same as the probability of any old thing being a duck, times the probability of a duck looking like that, divided by the probability of a thing looking like that.

$\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}$

This makes sense:

• If ducks are mythical beasts, then $Pr(duck)$ (our “prior” on ducks) is very low, and the thing would have to be very duck-like before we’d believe it’s a duck. On the other hand, if we’re at some sort of duck farm, then $Pr(duck)$ is high and anything that looks even a little like a duck is probably a duck.
• If it’s very likely that a duck would look like that ($Pr(looks|duck)$ is high) then we’re more likely to think it’s a duck. This is the “likelihood” of a duck looking like that thing. In practice it’s based on how the ducks we’ve seen before have looked.
• The denominator $Pr(looks)$ normalizes things. After all, we’re in some sense portioning out the probabilities of this thing being whatever it could be. If 1% of things look like this, and 1% of things look like this and are ducks, then 100% of things that look like this are ducks. So $Pr(looks)$ is what we’re working with; it’s the denominator.

Here’s an example of a strange world to test this in:

There are ten things. Six of them are ducks. Five of them look like ducks. Four of them both look like ducks and are ducks. One thing looks like a duck but is not a duck. Maybe it’s a fake duck? Two ducks do not look like ducks. Ducks in camouflage. Test the equality of the two sides of Bayes’ rule:

$\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}$

$\displaystyle \frac{4}{5} = \frac{\frac{6}{10} \cdot \frac{4}{6}}{\frac{5}{10}}$

It’s true here, and it’s not hard to show that it must be true, using two ways of expressing the probability of being a duck and looking like a duck. We have both of these:

$\displaystyle Pr(duck \cap looks) = Pr(duck|looks) \cdot Pr(looks)$

$\displaystyle Pr(duck \cap looks) = Pr(looks|duck) \cdot Pr(duck)$

Check those with the example as well, if you like. Using the equality, we get:

$\displaystyle Pr(duck|looks) \cdot Pr(looks) = Pr(looks|duck) \cdot Pr(duck)$

Then dividing by $Pr(looks)$ we have Bayes’ rule, as above.

$\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}$

This is not a difficult proof at all, but for many people the result feels very unintuitive. I’ve tried to explain it once before in the context of statistical claims. Of course there’s a wikipedia page and many other resources. I wanted to try to do it with a unifying simple example that makes the equations easy to parse, and this is what I’ve come up with.

# Some theory and practice for data cleaning

Data cleaning can refer to many things. I’ll mention some important data structure ideas elucidated by Wickham, and then spend more time on the topic of assumptions and data coherence, and how data cleaning happens in practice. By the end, we may achieve some hope of getting reasonably meaningful results from our data.

Hadley Wickham has made available very good explanations (paper, slides, presentation) of what he calls “tidy data”. Tidy data has variables stored in columns, observations in rows, and a single type of experimental unit per dataset. This framework is both a good idea in its own right, and it also let’s you easily use a lot of very good data manipulation and graphics tooling in R – much of which is also due to Wickham.

The idea of tidy data provides guidance in how to begin restructuring data for analysis. What, then, are the data problems that are not fundamentally due to data structure?

Harry Frankfurt has a charming essay, published as a small book, called On Bullshit.

The book is quite funny, and it also provides a useful definition of bullshit: communicating without concern for the truth, or with “indifference to how things really are” (p. 34). It claims, and may be correct, that “bullshit is a greater enemy of the truth than lies are” (p. 61).

A lot of data has this bullshit lack of regard for facts, and it is this sort of data problem that I am principally concerned with. This class of problem can arise all throughout the data quality framework of McCallum, which calls for data to be complete, coherent, correct, and accountable, though typically coherence admits most easily of self-contained testing.

The vast majority of data problems are not intentionally introduced. They appear by accident or through miscommunication, and because they were not detected and corrected. It isn’t easy to completely assure the quality of a data set, and in any event there frequently aren’t resources allocated to thorough checking. Data can seem quite authoritative when viewed from a distance – why check it?

It would be nice to have at least two individuals (or teams) develop the same data product in parallel, collaborating on a shared set of business rules, so that they can check their final results against one another, and have a further QA step before release. But the model of a lone individual hacking out a spreadsheet without so much as a second look before release is unfortunately common. When it comes to open data initiatives, for example, it sometimes seems that data is scooped from a trough and thrown into the world.

Data dirtiness could be addressed at creation, and we should keep this in mind when creating data ourselves. Keep it clean! But as it is often not, it is incumbent on the analyst to be diligent in checking and addressing issues with the data under analysis. Be aware of the process you are a part of.

All of analysis is coming up with beliefs. These beliefs are based on implicit models. Probably the most common cause of initial results from data is error. This is especially true for interesting results, but it applies everywhere. For example:

$wc -l data.csv 5 data.csv A common reading of this would be that the file data.csv contains five lines, or rows. However, this is not necessarily the case: > read.table("data.csv") V1 1 I think...\nthat...\nit's good! 2 dunno 3 It's bad! There are three rows here. What happened? There was a problem with our assumptions, or with the implicit model we used. The program wc does not tell us how many lines there are in a file. It tells us how many newline characters (“\n”) there are – how many times there is a byte that goes 00001010. The implicit assumption was that each line of data had exactly one newline character. It was a linear model: “number of lines in the file = 1 * number of newline characters in the file + 0”. Notice that the model is fine, and the result is also fine – if the assumption holds. But often, we aren’t even aware of the assumption, we aren’t aware of the implicit model, and we don’t check what we’re relying on. The example with wc is about being aware of levels of abstraction and how they interact with your tools so as to not draw incorrect conclusions. The sorts of assumptions more closely related to what we usually think of as clean data issues are things like: • this is everything we should include • there’s exactly one record per entity • the line items add up to the subtotals • these are all in the same units • these dates are all in order A good place to start with a new data set is by checking for problems that could not possibly occur. Very often, you will find that they occur anyway. A common example is checking unique IDs. They are frequently not. When it comes to checking things, I recommend never checking anything once. When you become aware of something that should be true about your data, write a check that embeds itself forever in your code. Very occasionally there are performance concerns, but almost always correctness is more important. The “theory” here is that the strength of the computer – automating things – should be leveraged to reduce human cognitive load while increasing confidence in results. Include explicit assumption tests in analysis code. Checks can also be included for things that need not be true – for intermediate results, totals, and so on. Analysis code may have many steps, and it may not be obvious how a change at an early step affects things downstream. Including something as simple as, for example, stopifnot(sum(data$total)==9713) at the bottom of a script will alert you if you introduce something that changes this – especially useful to know when you think you’re making changes that don’t.

Another way to make this point about coded checks is that comments in code should be avoided for describing data.

# There are 49 missing values.     # NO
stopifnot(sum(is.na(my.data))==49) # YES

The comment is instantly out of date if something changes, and likely forgotten. The code will automatically let you know if something changes. This is vastly superior.

Quite a lot of checks can be written very simply – for example, the often-neglected checks around merges (joins). Many more complex checks are included in the assertive package available for R, which has things like assert_all_are_credit_card_numbers and assert_all_are_isbn_codes, for example.

Many checks will be domain-specific, but it’s common to have univariate distributions to investigate. The tails (outliers) are often the result of some data problem, and we should be grateful when this is the case, because it’s much easier to notice than if the spurious data is throughout the range of correct values. Of course the tails are also important places to take care because they will have dramatic effects on many types of analysis.

This is a place where data visualization is an essential tool for data cleaning. Histograms can show some information about a distribution. I often prefer what Cleveland calls quantile plots (different from QQ plots) which can be easily made in R (with index rather than quantile labels) as plot(sort(my.data)). These plots do not depend on bin size – every data point is shown. Much more of the fine structure of a distribution can be quickly and easily taken in. This image shows a contrived data set two ways, illustrating the occasional advantage of quantile plots.

Many data checks might be called “reality checks”. It is worth being aware of the danger of crossing over into “unreality checks” when our biases overrule the data. There is often an essentially Bayesian flavor to checking data. We suspect that a janitor’s salary of \$100,000,000 is not correct because of our prior beliefs about the world. Probably we are correct to say that this data is bad. But if we are over-zealous in eliminating or adjusting data that seems wrong, we may end up eliminating or adjusting the most interesting correct data as well. Care must be taken to make choices that maximize the information content of our data.

When possible, fix data. For example, adult heights entered sometimes in feet and sometimes in centimeters can be easily adjusted to either by machine because their ranges do not overlap. This is probably better than dropping all of either type, but choices like this may not always be easy.

It’s not uncommon to hear it said that 80% of a data project is data wrangling work. But this doesn’t mean that you can do data cleaning Monday through Thursday and analysis on Friday. It is a good idea to think about data cleaning up front, but there are often practical concerns about the scope of the data cleaning work.

> ncol(my.data)
[1] 1274

In addition, it’s often the case that you discover data problems in the course of analysis. (Sometimes, this is the only way that data problems are discovered.) You likely can’t predict at the outset exactly which fields you’ll need or how problems will arise. A simple example is that two columns may each look fine, but if analysis calls for their ratio, absurd data problems can become evident.

Another concern is that data issues are often stubbornly non-general. We would like to aim to generalize, but may find that ad hoc solutions are sometimes unavoidable. We are left aiming to generalize, preparing to specialize.

Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles. As usual:

The difference between theory and practice is larger in practice than it is in theory.

# Use counts, not percentages

Consider this data:

total    part   percent
765      47        6%


Clearly, there is some redundancy. Both part and percent express the same thing.

With infinite precision, you could use either part or percent at your pleasure. However, in the common case where the counts (total and part) are integers and the percentage(s) are not, computers will store the integers generally much more nicely and compactly than nasty decimal things (floats or string representations).

Percentages also commonly get rounded off, in which case information is lost. In the above example, 6% of 765 could be anything from 43 to 49, and possibly even more depending on what precision is used for the calculation.

The moral of the story is that for data, you should always use counts, not percentages.

# Building Predictive Models for NYC High Schools

I don’t have anything to say about building predictive models today myself, but I did want to share this paper from Alec Hubel, which demonstrates some of the really interesting things that you can do even with just the publicly available education data out there on the net.

Common warnings about correlation vs. causation in education definitely apply, but I think this provides a really neat look at some of the data that NYC makes available. I like it even just for the second figure, which is a simple but rarely seen visualization that makes it easy to explore, by looking, some of the patterns across districts.

Thanks Alec!

# NYC standardized test results: Changes in average scores by year – where is the Common Core shake-up?

New York City had changed Math and ELA tests in 2013, aligning them to the Common Core. This was billed as representing a big shift, testing deeper concepts and so on. We’ve seen that the distribution of scores shifted down dramatically in 2013, and results aren’t being reported for District 75 schools any more. Also percent proficient is down a lot, but that’s like saying more people are now shorter than a stick you raised in the air, so I’m not paying it any mind. The overall position of the distribution of scores is also pretty much arbitrary. It would be interesting if we saw changes in school grade position in the distribution change more from 2012 to 2013 than between other years, indicating that the tests changed more from 2012 to 2013 than between other years. Do we see this?

Figure 12a. Changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Nope. What if we use this other technique to give our eyes some help?

Figure 12b. Density of changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Still nope. If anything, the 2013 results resembled the 2012 results more than other years have. Any way you look at it, the 2013 tests don’t seem to have shuffled NYC schools any more than other years’ tests. Of course, this would be more meaningful if there was in general less shuffling with every new test administration. I’m still not over how little stability there is in the school average scores.