Columbia‘s Center for Public Research and Leadership hosted a panel last night entitled Evaluating K-12 Teachers: What’s Left to Debate in the Wake of the MET and Chetty-Friedman-Rockoff Studies?.

The panelists were Thomas Kane of the Gates Foundation‘s Measures of Effective Teaching (MET) project and Jonah Rockoff of the “Chetty-Friedman-Rockoff study” (The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood). Panelists Rob Weil of the American Federation of Teachers and Shael Polakow-Suransky of the New York City Department of Education responded after the two researchers presented their executive summaries.

There were three questions from the audience after the panelist’s remarks, of which mine was the last. It ran thus: It’s clear that there is some information in teacher value-added results in the aggregate, and more or less clear correlations can be illustrated if teacher results are heavily binned and averaged before creating, for example, the most popular plots in both the MET and C-F-R reports. However, what matters for K-12 teacher evaluation is the individual teacher level results, so we should be interested in their reliability. A preliminary MET report (page 18, table 5, prior-year correlation) is the only place that gives any measure of this reliability for the MET study, despite (or perhaps because of) MET’s unquestioning use of VAM as a known-good model. The highest year-over-year correlation in teacher scores is for math, at 0.404. (I was generous in not mentioning that it’s less than half that for ELA.) If you square the math correlation, you can conclude that teacher value-added scores are essentially four parts noise for every one part signal. They change a lot from year to year – more than we can reasonably expect that teachers actually change. Professor Kane had used the metaphor of a bathroom scale for VAM. If your bathroom scale measures you one year at 240 pounds and the next year at 40 pounds, you start to question the usefulness of your bathroom scale.

After all this setup, my actual question was in two parts. First, how much random noise is an acceptable level for high-stakes teacher evaluations? Second, are the new VAM calculations that the New York State Department of Education (NYSED) commissioned from the American Institute for Research (AIR) any more reliable than the measures of the MET project and other VAM, which have similarly poor year-over-year stability?

The panelists answered neither of my questions, but they did say some things in response.

Rockoff said that he’d thought a lot about the issue. Then he offered that batting averages are similarly unstable for players, year over year. He also suggested that sales figures for salespeople are probably unstable as well.

I don’t know much about baseball or sales, so I don’t know if his claims are true. Even if they are, I’m not particularly moved by an argument of the form “everything is bad, therefore let’s call it good”. Further, I think there’s a big difference between VAM and a statistic like batting average (hits over at-bats – I looked it up) or sales (just add them, I imagine) – these are simple calculations that everyone can follow, and moreover they are quite directly linked to what the person in question actually does. They’re also pretty hard to fake. And importantly, if somebody else has a high or low batting average, it doesn’t affect your batting average. You can still get a high or low batting average regardless of other players. Value-added, on the other hand, is almost entirely a zero-sum game: the reason a teacher has a higher score is because their students’ test scores went up more relative to other teachers’ test score gains – it’s not on some absolute scale of learning. And I don’t know of any laws that evaluate baseball players based on just their batting average and nothing else. I have a hard time getting from Rockoff’s statements to a satisfying argument for VAM as a yearly teacher evaluation method.

Kane also indicated that he’d been anticipating such a question as mine. His response was that we shouldn’t be concerned with year-over-year correlation, we should be concerned with year-to-career stability, which is higher.

Well of course it’s higher. That’s basically the Law of Large Numbers. If you average a bunch of high-noise measurements, the measurements will look more like their average than they look like each other, on average. This reality doesn’t make the individual observations any better.

It’s not that I don’t like the Law of Large Numbers – certainly we can trust VAM more (relatively) when looking at three years of results, compared to just one year’s results. And professor Kane did kind of hint that it would be good for a principal to use three years of data before trying to make a tenure decision. I believe the C-F-R study also used more than one year to estimate teacher effectiveness. But the fact remains that the one-year point estimates are what people have been looking at and will be looking at when they’re evaluating teachers. I don’t think these numbers are good enough. It’s bad science. We can’t trust the measurement.

I didn’t get to ask my follow-up question, which was this: Douglas Harris, author of Value-Added Measures in Education (on Harvard Education Press) advocated in a September 2012 article specifically against including value-added measures with other measures in a composite measure of teacher effectiveness. He argued for using value-added as a screen, in the same way that a doctor gives many people an inexpensive screening test and then follows up more carefully with the patients who are indicated, while knowing that many of those patients may have had false positive results. Could we do something like this?

Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains – Plan Space from Outer Nine