Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains

I was recently pointed to the 2010 USDOE/Mathematica paper that shares this post’s name. One has to think that Rockoff and Kane et al. have seen it, but nobody seems to ever mention it. From the abstract:

Simulation results suggest that value-added estimates are likely to be noisy using the amount of data that are typically used in practice. Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data.

Those numbers are awful. What’s interesting to me is that they aren’t even that much better when using three years of data, as compared to one year. I had thought they would probably improve a lot with triple the data. They just stay bad.

This study used a fairly simple VAM, similar to what they do in Tennessee, so that’s one possible critique. But the fact is, this is the only research I’ve seen that seriously attempts to address the trustworthiness of VAM at the teacher and school level. Everybody else seems to be ignoring it, as if there’s no cost to making arbitrarily incorrect judgements about teachers’ work. This paper is worth a look.

Human-Computer Cooperation for Data Merging

This problem comes up all the time, but the instance that got me thinking about it most recently was this: The NYC MTA provides subway station geo-coordinates like this:

127,,Times Sq - 42 St,,40.75529,-73.987495,,,1,
624,,103 St,,40.7906,-73.947478,,,1,
119,,103 St,,40.799446,-73.968379,,,1,

But their subway turnstile data is linked with a table like this:

R180	R252	103 ST		6		IRT
R191	R170	103 ST		1		IRT

Notice a couple things. There’s only one subway stop called ‘Times Square’, but the name is different in the two files. That’s one thing, but then there are actually three different stations called ‘103 Street’. (Only two are shown here.) To tell the difference between them you have to look at subway lines (1, 6, etc.) in one file, and at longitudes in the other. It’s a huge pain even identifying the contenders to choose between, and certainly no generic matching algorithm will do it correctly for you. It’s such a specialized case that it probably isn’t worth developing an algorithm just for this data – or perhaps it’s that an ‘algorithm’ which would solve this problem would be essentially identical to just doing the matching by hand anyway.

This kind of merge difficulty comes up all the time while getting data munged into a usable state. In this example, I just want to attach the turnstile data to the geo data, and most of the matches aren’t terribly hard, but it’s hard enough that it won’t happen automatically, and it would be a super big hassle to do by hand, grepping or control-f’ing around the files to build a merge table. More than once I’ve wished there was a tool that would help me do this kind of thing quickly.

One way to improve the situation going forward is to have everyone in the world implement unique IDs and/or controlled vocabularies for absolutely everything, and enforce their usage through input validation and strong checking of everything at every point in data processes. That would be nice. I think some messy data situations may persist.

There are a couple special cases of this matching problem that have received a lot of attention:

Geocoding is essentially doing messy matching on strings that represent addresses in order to figure out a unique match with associated geo-coordinates.  These can be found in various GIS software packages, and New York City makes available such a tool specifically for NYC addresses.

Record linkage in health care and education seeks to match up humans records using name, date of birth, and so on. My colleagues at DOE and I worked (and they continue to work) on the education case, for example matching college outcome data to DOE internal student ID numbers that have associated test scores. The Link King is a pretty complete human-matching tool birthed in health care but possibly more broadly applicable. (It is itself free but runs on top of SAS, which isn’t.)

I don’t particularly care about address matching or human matching, for two reasons. First, they won’t help me with the subway stations. I want a general-purpose tool. Domain-specific rules are certainly useful in specific domains, but I don’t like them much in general. Second, in both cases the tools achieve some match percentage which I don’t find acceptable. I want to achieve 100% matched data. I want a tool that will make it easy for me to personally review any case that isn’t 100% certain. (The Link King has some functionality for this, but it’s super specific to human-identifying data. Also, it runs on SAS, which is gross.)

There are more general tools which compare strings for similarity. They all seem to be based on Soundex or Levenshtein edit distance or something like these. Python has difflib, there’s string_score for JavaScript, etc. These are a good start, I think. SeatGeek has their FuzzyWuzzy, which extends basic string comparisons to work better for common cases while still remaining fairly general.

I want to be super OCD with these merges though – I don’t want to just hope that the computer is doing a good job matching. So what I really want is not just the computer to find a likely match, but to let me confirm or correct the matches as it goes. I want not just an algorithm but an interface. In the case of the 103rd St subway stop, it should show me that there are three good possible matches and let me work it out – and it should make the whole operation as frictionless as possible.

I’m imagining this as a JavaScript/web app, probably running entirely client-side. You give it two lists of values that you’d like to be able to merge on. For example,




For each value in the first list, the interface shows you a list of options from the second list, ordered by apparent likeliness of being a match, based on string comparisons of some kind. So you would quickly click on “Cow” to match with “cow” and perhaps agree more slowly that “duckies” is a match for “Ducks” and perhaps specify new common names for both matches (it could just use the first list’s version by default as well). The interface would then produce these:

mergevals1	common
cow		cow
Ducks		duck


mergevals2	common
duckies		duck
Cow		cow

Then you can merge common onto your first data set, and also onto your second data set, and then you can merge both your data sets by common.

(In the subway case, you would pass in a combined column with station name and lines on the one side, and station name and geo-coordinates on the other.)

The process would still have a lot of human review, but the annoying parts are sped up by the computer so that the human user is just doing the decision-making. I think this would make a lot of important and useful data merges way more feasible and therefore lead to more cool results.

Maybe I’ll make this.

Value-Added Measures Unstable: Responses from Rockoff and Kane

Columbia‘s Center for Public Research and Leadership hosted a panel last night entitled Evaluating K-12 Teachers: What’s Left to Debate in the Wake of the MET and Chetty-Friedman-Rockoff Studies?.

The panelists were Thomas Kane of the Gates Foundation‘s Measures of Effective Teaching (MET) project and Jonah Rockoff of the “Chetty-Friedman-Rockoff study” (The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood). Panelists Rob Weil of the American Federation of Teachers and Shael Polakow-Suransky of the New York City Department of Education responded after the two researchers presented their executive summaries.

There were three questions from the audience after the panelist’s remarks, of which mine was the last. It ran thus: It’s clear that there is some information in teacher value-added results in the aggregate, and more or less clear correlations can be illustrated if teacher results are heavily binned and averaged before creating, for example, the most popular plots in both the MET and C-F-R reports. However, what matters for K-12 teacher evaluation is the individual teacher level results, so we should be interested in their reliability. A preliminary MET report (page 18, table 5, prior-year correlation) is the only place that gives any measure of this reliability for the MET study, despite (or perhaps because of) MET’s unquestioning use of VAM as a known-good model. The highest year-over-year correlation in teacher scores is for math, at 0.404. (I was generous in not mentioning that it’s less than half that for ELA.) If you square the math correlation, you can conclude that teacher value-added scores are essentially four parts noise for every one part signal. They change a lot from year to year – more than we can reasonably expect that teachers actually change. Professor Kane had used the metaphor of a bathroom scale for VAM. If your bathroom scale measures you one year at 240 pounds and the next year at 40 pounds, you start to question the usefulness of your bathroom scale.

After all this setup, my actual question was in two parts. First, how much random noise is an acceptable level for high-stakes teacher evaluations? Second, are the new VAM calculations that the New York State Department of Education (NYSED) commissioned from the American Institute for Research (AIR) any more reliable than the measures of the MET project and other VAM, which have similarly poor year-over-year stability?

The panelists answered neither of my questions, but they did say some things in response.

Rockoff said that he’d thought a lot about the issue. Then he offered that batting averages are similarly unstable for players, year over year. He also suggested that sales figures for salespeople are probably unstable as well.

I don’t know much about baseball or sales, so I don’t know if his claims are true. Even if they are, I’m not particularly moved by an argument of the form “everything is bad, therefore let’s call it good”. Further, I think there’s a big difference between VAM and a statistic like batting average (hits over at-bats – I looked it up) or sales (just add them, I imagine) – these are simple calculations that everyone can follow, and moreover they are quite directly linked to what the person in question actually does. They’re also pretty hard to fake. And importantly, if somebody else has a high or low batting average, it doesn’t affect your batting average. You can still get a high or low batting average regardless of other players. Value-added, on the other hand, is almost entirely a zero-sum game: the reason a teacher has a higher score is because their students’ test scores went up more relative to other teachers’ test score gains – it’s not on some absolute scale of learning. And I don’t know of any laws that evaluate baseball players based on just their batting average and nothing else. I have a hard time getting from Rockoff’s statements to a satisfying argument for VAM as a yearly teacher evaluation method.

Kane also indicated that he’d been anticipating such a question as mine. His response was that we shouldn’t be concerned with year-over-year correlation, we should be concerned with year-to-career stability, which is higher.

Well of course it’s higher. That’s basically the Law of Large Numbers. If you average a bunch of high-noise measurements, the measurements will look more like their average than they look like each other, on average. This reality doesn’t make the individual observations any better.

It’s not that I don’t like the Law of Large Numbers – certainly we can trust VAM more (relatively) when looking at three years of results, compared to just one year’s results. And professor Kane did kind of hint that it would be good for a principal to use three years of data before trying to make a tenure decision. I believe the C-F-R study also used more than one year to estimate teacher effectiveness. But the fact remains that the one-year point estimates are what people have been looking at and will be looking at when they’re evaluating teachers. I don’t think these numbers are good enough. It’s bad science. We can’t trust the measurement.

I didn’t get to ask my follow-up question, which was this: Douglas Harris, author of Value-Added Measures in Education (on Harvard Education Press) advocated in a September 2012 article specifically against including value-added measures with other measures in a composite measure of teacher effectiveness. He argued for using value-added as a screen, in the same way that a doctor gives many people an inexpensive screening test and then follows up more carefully with the patients who are indicated, while knowing that many of those patients may have had false positive results. Could we do something like this?