NYC standardized test results: Changes in average scores by year – where is the Common Core shake-up?

New York City had changed Math and ELA tests in 2013, aligning them to the Common Core. This was billed as representing a big shift, testing deeper concepts and so on. We’ve seen that the distribution of scores shifted down dramatically in 2013, and results aren’t being reported for District 75 schools any more. Also percent proficient is down a lot, but that’s like saying more people are now shorter than a stick you raised in the air, so I’m not paying it any mind. The overall position of the distribution of scores is also pretty much arbitrary. It would be interesting if we saw changes in school grade position in the distribution change more from 2012 to 2013 than between other years, indicating that the tests changed more from 2012 to 2013 than between other years. Do we see this?

Figure 12a. Changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Figure 12a. Changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Nope. What if we use this other technique to give our eyes some help?

Figure 12b. Density of changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Figure 12b. Density of changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade, 2006-2013

Still nope. If anything, the 2013 results resembled the 2012 results more than other years have. Any way you look at it, the 2013 tests don’t seem to have shuffled NYC schools any more than other years’ tests. Of course, this would be more meaningful if there was in general less shuffling with every new test administration. I’m still not over how little stability there is in the school average scores.

[table of contents for this series]

NYC standardized test results: Changes in average scores for school grades and cohorts

I’ve normalized and re-normalized these average scores so that I can compare the scores across years and across grades. Now, there are good reasons not to do this. The tests aren’t vertically aligned. The fourth-grade test in math could be on different material than the third-grade test, etc. I don’t have student level information, so I don’t know which students are really in these averages. Tests are evil. And so on, etc., etc. I’m going to do it anyway, and hopefully I’ll be sufficiently critical of any results.

I suspect that, as was mostly the case for the number of students tested, test performance will vary more at the same school and same grade from one year to the next (4th grade 2008 to 4th grade 2009) than at the same school and same cohort from one year to the next (4th grade 2008 to 5th grade 2009). This would indicate that the test measurement error (at the school grade level, for this data) is smaller than the variability of classes at schools. If this is the case then there’s some hope for using these averages to try to get a sense of how well schools are educating their students. If not, then we’ll have less confidence about many things.

Well, here’s the result:

Figure 11-1a. Changes in average average Math test scores for random records, same grade year to year, and same cohort grade to grade

Figure 11-1a. Changes in average average Math test scores for random records, same grade year to year, and same cohort grade to grade

That’s a little disappointing. Cohort doesn’t look appreciably stabler than grade, if at all. Going into sixth grade, the distribution isn’t even centered at zero! Of course we only see schools that have fifth and sixth grades there, which isn’t as many schools. But what’s causing that? Do those schools have an influx of smart middle-schoolers who weren’t in their fifth grades? That seems unlikely. Is it a result of how students move around system-wide between fifth and sixth grades? Is it an artifact of my chosen normalizing method? Curious. Here’s the same graph for ELA.

Figure 11-1b. Changes in average average ELA test scores for random records, same grade year to year, and same cohort grade to grade

Figure 11-1b. Changes in average average ELA test scores for random records, same grade year to year, and same cohort grade to grade

Okay actually, before moving on, here’s one more look at changes by grade. I really wanted them to be stabler for cohorts. The density plots below could help deal with overplotting issues, but the conclusion is about the same. Cohorts are particularly strange going from fifth to sixth grade. For ELA, cohort scores might be a little stabler going into seventh and eighth grades, but it doesn’t make me particularly thrilled.

Figure 11-2. Density of changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade

Figure 11-2. Density of changes in average Math and ELA test scores for same grade year to year and same cohort grade to grade

I can’t think of a way for the above to be a good result for anybody, aside from finding something interestingly weird about fifth-to-sixth-grade cohorts.

[table of contents for this series]

NYC standardized test results: Schools fight the Law of Large Numbers

After all the hemming and hawing over choices that make no visible difference in the plots of this post, I’ve decided I like the bottom right option from the last post the most – subtracting out the estimated student-level median and then dividing by median absolute deviation. So be it!

The question of this post, however, is how the variability of school grade average scores changes with the number of tested students. Smaller samples generally produce more extreme results. There’s less chance for regression to the mean, as it were. The Law of Large Numbers hasn’t had a chance to take effect. If we’re drawing randomly, then the variance of the sample mean gets small when the sample size is big. Does that happen for school grade averages?

Figure 10a. Normalized average Math scores vs. number of tested students for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 10a. Normalized average Math scores vs. number of tested students for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

It does, to some extent. The graphs are vaguely cone-shaped. (ELA is a little more cone-shaped; see below.) Of course, students are not randomly assigned to schools, which is what makes these graphs particularly interesting. The elementary grades (including 3 to 5) are generally smaller, yet it looks like the elementary grade averages approach the student average even while the middle school grades (6 to 8) vary widely. This seems consistent with the idea of smaller local elementary schools that educate mostly whoever is nearby, with more of a filtering effect starting with middle school – a parent might be more likely to support a longer commute to a better school, which also means the better school draws good students from a wider area.

It is good news that the school scores do mostly seem to converge toward average when there are more students. If there were perfect segregation of better and worse-performing students we could see all these averages avoiding the center red lines entirely.

If you squint a little, it looks like there somehow aren’t any middle schools with around 200 students in a grade that perform much above average in ELA. You can almost see something like that for math too. Weird. (Code.)

Figure 10b. Normalized average ELA scores vs. number of tested students for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 10b. Normalized average ELA scores vs. number of tested students for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

[table of contents for this series]

NYC standardized test results: Normalizing the distributions of average scores

Ever since starting to look at this data, we’ve dreamed a great dream: a dream of comparing data across years and grades. Deciding to remove District 75 school results eliminated some nasty tails from the distributions, but there’s more work to be done. As a clarification up front: It’s called normalization, but this work will not make the distributions normal, in the Gaussian sense. Don’t get it twisted!

First, we’ll start looking at density plots rather than histograms. This normalizes over the number of observations, giving every such plot equal area under the curve, and makes it easier to get a quick sense of the shapes of the distributions. Here’s how we’re doing:

Figure 9-1a. Average reported Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-1a. Average reported Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-1b. Average reported ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-1b. Average reported ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Clearly, problem one is that the distributions are not agreeing about where they are in the left-to-right sense. I’ll center them by subtracting out the medians of each distribution.

Figure 9-2a. Centered average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-2a. Centered average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-2b. Centered average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-2b. Centered average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

That’s much better – though especially for ELA, some distributions are much more peaked than others. With the less peaked ones, we can have fun deciding which looks the most like a boa constrictor digesting an elephant. But to address the non-uniform variability of the distributions, I’ll divide by the median absolute deviations, giving these results:

Figure 9-3a. Normalized average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-3a. Normalized average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-3b. Normalized average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Figure 9-3b. Normalized average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013

Now those distributions look… at least a little similar. And those are the normalized scores I’ll use going forward. Why did I choose median and median absolute deviation, rather than the more common mean and standard deviation, which gives what is commonly called “z-scores”? Because it doesn’t work as well. In particular, it gives substantially worse results for some of those spiky ELA distributions. I tried several possibilities:

Figure 9-4a. Average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013, normalized several ways, overplotted

Figure 9-4a. Average Math scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013, normalized several ways, overplotted

Figure 9-4b. Average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013, normalized several ways, overplotted

Figure 9-4b. Average ELA scores for non-D75 NYC public schools (charter and non-charter) grades 3-8, 2006-2013, normalized several ways, overplotted

Z-scoring (top left sub-plots) works pretty well for the math score distributions, but using standard deviation (top row sub-plots) is a disaster for the ELA distributions. I had initially favored using a student-level statistic for centering, since it makes the school averages somehow more sensible (above or below the student average, vs. above or below an average of school averages) but the un-weighted median gives better results, I think. Take a look at the full-resolution images and I think you’ll agree with my choice. Probably. In fact now I’m having second thoughts.

I’ll note, while I think, that every time I’ve ever looked at anything for both Math and ELA results, ELA has always been weirder. Math results behave more like measurements. I wish I had a formal way to say that, or some theoretical explanation aside from some hand-waving mixture of “math skills are more testable” and “who knows what’s going on with ELLs”.

The issue with normalization is a choice: what information do we get rid of, and what do we keep? I don’t think too highly of these standardized test results, really. I think of them mostly as inducing an ordering of schools. But if I use a percentile or rank, then changes in the middle of the distribution are exaggerated, which I don’t think makes sense. A two-point difference in the middle of the distribution could be huge, viewed that way.

Can we get rid of differences in the variances? What if some years the test really was less sensitive, so all the scores really were bunched up together? Well, I don’t want that make it look like there are changes from year to year. You shouldn’t appear to improve just because the distribution shrinks. So I feel 100% okay with trying to eliminate differences in the spread of these distributions.

What about the centering? The issue with using a school-level average is that it doesn’t seem to mean much. Is it demonstrably bad? Let’s see… Yes. It’s bad. If you have a school of all exactly average students, your school score should be a zero. Using the mean of school averages, your school of average students would see its school score change depending on the how students happen to be arranged among other schools. Your school’s score shouldn’t depend on whether there are two other schools or three, so using the mean of school averages is bad. Put another way, though using the student (weighted) mean gives less perfect alignment of the distributions, I think that non-alignment is meaningful information.

Okay! I’m not going to update everything up above. The graphs would look almost identical anyway. But I will use the normalization that centers on student average and divides by median absolute deviation to address variance. Decision made! Either way, the units of these scores are going to be screwy. But next time the code will be shorter!

[table of contents for this series]

NYC standardized test results: Number of students tested at the school grade subject level

After all the preceding, it might be interesting to look at some totals by subject, but I think the time has come to look at the school grade level numbers directly. This analysis with the number of tested students should have some similarity to eventual analysis of average scores themselves, if all goes well.

Figure 8a. Changes in number of NYC public school Math tests for random records, same grade year to year, and same cohort grade to grade

Figure 8a. Changes in number of NYC public school Math tests for random records, same grade year to year, and same cohort grade to grade

This is another Tukey mean-difference plot, showing changes in number of tested students in Math. The top panel randomly pairs records to show a sort of baseline maximal variation. Grades three to five have less top end because elementary grades tend to be smaller than than middle school grades. The middle panel shows changes in numbers for the same grade at the same school, between years. So for example if a school had 40 grade 4 math tests in 2008 and 50 grade 4 math tests in 2009, that’s a change of +10 for grade 4. The bottom panel shows changes in numbers for the same cohort at the same school, between grades (which is also between years, of course). For example, if a school had 40 grade 4 math tests in 2008 and 42 grade 5 math tests in 2009, that’s a change of +2, which will show up in the grade 5 sub-panel. Since testing starts in grade 3, there isn’t any change to be observed for the cohort into that grade.

As expected, there is typically more variation in grades than in cohorts. This is true even for the cohort change into grade 6 versus the grade 6 year to year, though it’s closer (MAD 10 vs. 9). We know that schools tend to change more from grade 5 to grade 6, generally – students could come from an elementary school and enter a K-8 school, for example, and that would show up here. There are conspicuous outliers for cohort changes especially going from fifth to sixth grade and from sixth to seventh. This is a little weird. They are almost all positive, meaning that a school seems to have started testing a lot of students in seventh grade, for example, who weren’t tested in sixth grade. It could also be that a school has a big influx of sixth graders from other schools, as another possibility. These outliers suggest that some care should be taken in later analysis of test scores – perhaps a check that numbers tested are more or less similar.

Because of the outliers, changes for grades and changes for cohorts actually have similar overall standard deviations (about 20) but the median absolute deviation for grade changes (10) is twice what it is for cohort changes (5).

Figure 8b. Changes in number of NYC public school ELA tests for random records, same grade year to year, and same cohort grade to grade

Figure 8b. Changes in number of NYC public school ELA tests for random records, same grade year to year, and same cohort grade to grade

This figure showing the same things but for ELA testing looks very similar but is in fact different. (See code.) The extreme similarity even for outliers may be reflecting that Math and ELA testing go hand in hand, so what happens for one also happens for the other. Hopefully this can be taken as evidence that apparent cohort swings are due to student movement and not changes in whether particular students were tested or not. It might be worth tracking down explanations for some of these strange patterns anyway, but I won’t pursue this for the moment.

[table of contents for this series]

NYC standardized test results: The total number of tests by grade viewed by cohort

After removing District 75 results, this continues the investigation of total numbers of tests reported in the NYC data.

Figure 7a. Total number of tests reported for NYC public schools (charter and non-charter) by grade for 2006-2013

Figure 7a. Total number of tests reported for NYC public schools (charter and non-charter) by grade for 2006-2013

Looking at the number of tests given for each grade over time, we hope to see some pattern. But this is a somewhat limited way to look at the data. Observe that the number rises quite a bit for grade 4 between 2009 and 2010. And the number rises for grade 5 between 2010 and 2011. And it rises for grade 6 between 2011 and 2012. It’s like looking at a flip-book pages of something moving left to right.

What’s moving, of course, is a cohort of students. Usually the students in third grade in one year will be in fourth grade the next year. So we would like to compare the third grade of 2006 to the fourth grade of 2007, and so on. A simple calculation gives us a cohort identifier, which will be represented as the years during which a student would be tested for grades 3 to 8 assuming normal passage from grade to grade. For example, a student in the “2008-2013 cohort” started third grade in fall of 2007, took the third-grade test in spring of 2008, and eventually took the eighth-grade test in spring of 2013. We have complete data for three cohorts and incomplete data for ten cohorts, five of which may yet be completed, the other five of which have already moved on to high school and beyond.

Figure 7b. Total number of tests reported for NYC public schools (charter and non-charter) by cohort for 2006-2013

Figure 7b. Total number of tests reported for NYC public schools (charter and non-charter) by cohort for 2006-2013

If Figure 7a is like flip-book pages, Figure 7b is like single-image traces of everything that was moving. This idea of cohorts will be even more important later when thinking about changes in scores. Importantly, there is apparently more variability in the number of tests per grade over time than in the number of tests per cohort over time, which makes sense. A direct comparison will make this even clearer:

Figure 7c. A direct comparison of NYC grade 3-8 Math and ELA testing numbers by grade (top) versus by cohort (bottom)

Figure 7c. A direct comparison of NYC grade 3-8 Math and ELA testing numbers by grade (top) versus by cohort (bottom)

Viewed this way it’s apparent that the 2009-2014 cohort is markedly larger than the 2008-2013 cohort. To see this in the top here, equivalent to Figure 7a, we make many small grade judgements and it is more difficult to see the underlying phenomenon driving the action. Also, with the cohort view we preserve much of what was interesting about Figure 5b – perhaps even in improved form – since we can see the occasional drops in testing from grade 3 to grade 4, but we see them in the appropriate context. Viewing by cohort encourages the relevant comparisons.

[table of contents for this series]

NYC standardized test results: Considering District 75 schools

Figure 6a. All reported average Math scores for NYC public schools (charter and non-charter) grades 3-8, 2006-2013 (District 75 school results are in red; all other schools' results are in blue.)

Figure 6a. All reported average Math scores for NYC public schools (charter and non-charter) grades 3-8, 2006-2013 (District 75 school results are in red; all other schools’ results are in blue.)

When I first put all this data together, the strange low-scoring outliers were noted. We also observed a dip in total tests and students tested between 2012 and 2013, and it was hinted that this might be a result of District 75. What is the New York City Department of Education’s District 75?

District 75 provides citywide educational, vocational, and behavior support programs for students who are on the autism spectrum, have significant cognitive delays, are severely emotionally challenged, sensory impaired and/or multiply disabled.

There are two things to notice about D75 school results in this data, and both of them give us reason to exclude D75 results from all further analysis. First, District 75 scores are (almost always) very low and don’t seem to follow the same distribution as other schools’ scores. Second, there is for whatever reason no data for District 75 schools for 2013. It isn’t clear whether D75 students suddenly stopped ever taking these standardized tests or whether the city just stopped including their results, but whatever the reason it seems that D75 shouldn’t be included in further analysis.

Figure 6b. All reported average ELA scores for NYC public schools (charter and non-charter) grades 3-8, 2006-2013 (District 75 school results are in red; all other schools' results are in blue.)

Figure 6b. All reported average ELA scores for NYC public schools (charter and non-charter) grades 3-8, 2006-2013 (District 75 school results are in red; all other schools’ results are in blue.)

(The R code that generated these graphs is available.)

[table of contents for this series]