Notes from “Educational Data Mining: Predict the Future, Change the Future”

Data Mining doesn’t sound as sexy as Data Science these days, but CUNY’s initiative has pulled together a fantastic series of talks focusing on whatever you call it, as applied to education. Ryan Baker opened the series earlier today with an excellent overview of the whole field. CUNY will be posting video of the talk, as well as references to papers mentioned, but there was so much good stuff that I wanted to explore and leave some links to interesting things here.

Professor Baker started with a brief history of big data, which has been used for years in physics, biology, and meteorology. Now it’s popular across the web, largely in very clearly commercial applications, but it’s becoming very relevant to education, largely because educational software can collect so much data as students interact with it.

The applications of EDM described were very diverse, from predicting standardized test scores to automatically detecting student (dis)engagement, “gaming” of educational systems, or emotion broadly. An amusing example is detecting “WTF behaviors,” where “WTF” stands for “Without Thinking Fastidiously” – for example, if students are exploring a largely unsupervised 3D world, how do you know who’s really on task and who’s just carrying bananas to the toilet for kicks? (True story.)

It’s an exciting time! As Professor Baker says, “The data’s all there, you just have to find ways to link it.” And, of course, ways to learn from it. What follows will be a collection of things that you can link and learn from right now:

There are two major societies in this field:

Other interesting things:

School District Demographics System: Probably just the tip of the things-I-didn’t know-were-on-the-NCES-web-site iceberg. Neat interactive mapping and downloadable data sets by school districts.

Zombie Division: A game, apparently popular in the UK, of the ubiquitous “first-person destroy-numbered-skeletons-with-weapons-corresponding-to-their-divisors” genre.

I keep hearing more about Reasoning Mind, an online math ed thing. I haven’t fallen in love with it yet, but who knows.

PSLC DataShop: “The world’s largest repository of learning interaction data.” Sort of like a cross between the UCI Machine Learning Repository and ICPSR, intersect education. It’s at LearnLab, the Pittsburgh Science of Learning Center.

Professor Baker revealed that he’s planning a Coursera (MOOC) course in big data and education for the fall 2013 semester, which should be announced in August. I’m looking forward to it!

Signals” at Purdue: A really neat project that identifies college students at risk of not succeeding, as early as the second week of classes, and automatically recommends interventions as simple as an email lightly customized by a professor. Really shiny web site, too!

It would be nice if projects like Signals published more about their methods. Apparently universities in Europe, particularly Spain, are less competitive and so publish more about the programs they develop to help their students. Of course, it’d be nice if Knewton would publish too. Pearson does, for goodness sake.

Largely, it seems like the best ed tech follows the “all dashboards should be feeds” advice, not just showing data but identifying the actionable bits and even making recommendations for actions to take. Assistments, for example, will send an email summary of online homework completion, letting a teacher know which question students had the most difficulty with, and other highlights.

Speaking of Assistments, apparently Science Assistments has spun off and switched to a much less memorable name.

And finally, I found interesting an off-the-cuff top three of the kinds of things that Educational Data Mining studies. Here it is as I understood it:

  1. Knowledge modeling: “What do students know?”
  2. Knowledge structure modeling: “What’s the deal with these things we know?”
  3. Emotion/engagement modeling: “How are students feeling?”

I’m pretty interested in number two, but they’re all good. Looking forward to more talks at CUNY!

NYU Large Scale Machine Learning (Big Data) Lecture Two: More Methods and also LBFGS

[Slides and video from the lectures are online!]

After being packed into room 101 for week one, the class has been relocated to the expansive 109, which is big enough that we have become “sparse” – and much more comfortable!

My notes have become much longer and less coherent. I almost didn’t put this up; you can, after all, see the complete slides and video online. I’ve decided to just review my notes and leave summarized or interesting tid-bits here, removing most of the detail.

Professor LeCun presided over the first hour, starting from “(Almost) everything you’ve learned about optimization is (almost) irrelevant.” This is kind of funny, but also true in the sense that in machine learning, you aren’t generally in the position of just adding significant digits to a known good solution – you’re trying to maximize some performance which will be hurt by overfitting. So you aren’t really concerned with how fast you can get more digits in your solution. (“asymptotic convergence speed is irrelevant”)

Ah – a name I missed in the first lecture was “Léon Bottou“. He may be coming in to give a guest lecture. I guess he’s the guru of stochastic gradient descent. Professor LeCun very much enjoyed putting up an image from Pink Floyd’s The Wall to dramatize the way learning time goes up seemingly exponentially when you want greater precision from SGD, which is the issue from the previous paragraph.

Another interesting device employed by Professor LeCun was his comparison of online SGD to the Linux (open source, bazaar vs. cathedral) development model. SGD gets a quick imperfect release out quickly, and then improves it with feedback and iterations, as opposed to a batch method that has to look at all the data and really think about it for a while before it makes any model at all. Interesting.

And then we can understand why SGD takes forever to converge to high precision – looking at observations (training examples) one at a time exposes you to the raw noise of those individual observations. Professor LeCun calls this “gradient noise”.

An epoch is a complete pass through your training data. Online algorithms update the model many times per epoch, whereas batch algorithms update the model only once per epoch. Professor LeCun went on to show that batch gradient descent is slow to approach a solution through areas of low loss-function curvature. Online SGD is noisier (jumps around more) but gets closer to the solution faster. (You could omit “online” before “SGD” without changing the meaning, I think.)

What followed was an interesting discussion of optimizing your learning rate and how Newton’s method is not practical (O(n3)) but equivalent to gradient descent in a particularly warped space. Unfortunately the warping is also expensive to compute.

There was also some interesting connection to PCA, and something else, and then the final conclusion seemed to be that averaging your recent models after you’ve done SGD for a while is a good idea because you’re pretty sure you’re near the solution and you can hopefully just average out the remaining noise and get even closer to the solution than just stopping SGD somewhere arbitrarily and taking whatever slightly noisy model it was on.

Then there were some pretty neat demos showing the behavior of various algorithms with neat visualizations, all coded up in Torch. I’m not sure who did them – some student? I also hadn’t realized that Torch was all done with Lua. And the plotting was done in gnuplot, which I hadn’t really seen used since I read Data Analysis with Open Source Tools. Neat!

After a break, the lecture was handed off to Professor Langford, who was actually going to talk about a batch algorithm, which is a little out of character for him (and the class, perhaps). But it’s a particularly neat algorithm! I was dismayed for a while to not know what BFGS stood for, so it was good to find that it’s just a bunch of names. Yes four names of people who published independently but in the same year (1970) papers describing the method. Professor Langford described it as “an algorithm whose time had come.” The L came later (1980) and is just Limited-memory. So there’s the acronym dealt with!

Professor Langford did introduce some of the possible upsides of batch algorithms – namely that they tend to be more mature, perhaps better-implemented, and also possibly better for a parallel computing environment, where online methods can be hard.

The actual method is fairly elegant and I thought Professor Langford’s slides and explanation were quite good. BFGS basically approximates difficult things like the inverse Hessian using just simple things that you have just empirically.

You do have to start BFGS with something – Professor Langford recommended always starting with online learning. Just do some online learning to get more or less close to some sort of solution, and then (L)BFGS is good at getting the higher-precision digits.

It is a little interesting that the lecture started by discounting the importance of higher-precision digits, but came back to a method for finding some. To paraphrase Professor Langford, “If you’re dealing with ad display issues for major companies, it turns out that the fourth digit can actually matter.” (That could be a perfect quote; check it against the video if you want.)

I got to feel slightly smart when I knew the answer to one of the questions Professor Langford offered – but the question and answer have much the same character as the old joke, “Doctor, it hurts when I move my arm like this!” “So don’t move your arm like that!” The question was, “What do you do if you update your model and your loss actually goes up?” And the answer is basically, “Don’t update your model like that!” Really the answer was, only update your model half as far as you were going to and see if that fixes it, and repeat until you reduce loss vs. the old model, but you get the idea. It’s nice when there’s a question that I can handle.

Professor Langford did go on to warn against overfitting, saying that “these things are made to optimize, and that means they will overfit if they can.” He recommended some sort of regularized loss which I don’t have decent notes on, and then did some demos with vw. The immediate connection back to the machine is nice in this class. It isn’t just theory; there are working implementations that show you how it works. There was some discussion of how to stop and restart an algorithm (for example, with new data) by saving just some things about it. I’m really not sure why you can’t just save the entire working state of the algorithm; it may have to do with good ideas regarding the Limiting of memory.

Addressing convergence, Professor Langford took a pretty pragmatic view, saying that there are some theorems, but it’s rarely really quadratic, and you never perform exact line search. Also, it’s an interesting thing that LBFGS is robust enough to optimize even when your function doesn’t have a second derivative, and so in some sense it is (empirically) even better than Newton.

Finding the MIC of a circle

Maximal Information Coefficient is a measure of two-variable dependence with some interesting properties. Full information and details on how to calculate MIC using R are available at Here I calculate the MIC for an example (originally here) that Andrew Gelman posed on his blog.

This is Gelman’s proposed test data, with seed set to 42 for replicability.

n <- 1000
theta <- runif(n, 0, 2 * pi)
x <- cos(theta)
y <- sin(theta)
plot(x, y)

plot of circle

Sure enough this looks like a circle, which is clearly a relationship between x and y, but a standard correlation coefficient will not indicate this.

cor(x, y)

## [1] -0.008817

The interface to work with MINE is a little clumsy, but this will do it. (Be sure to have the R package rJava installed.)

df <- data.frame(x, y)
write.csv(df, "data.csv", row.names = FALSE)
MINE("data.csv", "all.pairs")

## *********************************************************
## MINE version 1.0.1d
## Copyright 2011 by David Reshef and Yakir Reshef.
## This application is licensed under a Creative Commons
## Attribution-NonCommercial-NoDerivs 3.0 Unported License.
## See
## for
## more information.
## *********************************************************
## input file = data.csv
## analysis style = allpairs
## results file name = 'data.csv,allpairs,cv=0.0,B=n^0.6,Results.csv'
## print status frequency = every 100 variable pairs
## status file name = 'data.csv,allpairs,cv=0.0,B=n^0.6,Status.txt'
## alpha = 0.6
## numClumpsFactor = 15.0
## debug level = 0
## required common values fraction = 0.0
## garbage collection forced every 2147483647 variable pairs
## reading in dataset...
## done.
## Analyzing...
## 1 calculating: "y" vs "x"...
## 1 variable pairs analyzed.
## Sorting results in descending order...
## done. printing results
## Analysis finished. See file "data.csv,allpairs,cv=0.0,B=n^0.6,Results.csv" for output

results <- read.csv("data.csv,allpairs,cv=0.0,B=n^0.6,Results.csv")

##                        [,1]       
## X.var                  "y"        
## Y.var                  "x"        
## MIC..strength.         "0.6545"   
## MIC.p.2..nonlinearity. "0.6544"   
## MAS..non.monotonicity. "0.04643"  
## MEV..functionality.    "0.3449"   
## MCN..complexity.       "5.977"    
## Linear.regression..p.  "-0.008817"

The result is that the MIC for this case is 0.6545, and the relation is highly non-linear. It isn’t clear that 0.65 is the ‘right’ answer for a circle, but it does provide more indication of a relationship than a correlation coefficient of zero does.

[This post also on rPubs.]

PCA, 3D Visualization, and Clustering in R

It’s fairly common to have a lot of dimensions (columns, variables) in your data. You wish you could plot all the dimensions at the same time and look for patterns. Perhaps you want to group your observations (rows) into categories somehow. Unfortunately, we quickly run out of spatial dimensions in which to build a plot, and you probably don’t want to do clustering (partitioning) by hand.

This example will use the iris data set available in R, which has four numeric variables. This is not very many, and the data is pretty nicely behaved, so the results of Principal Component Analysis and clustering will not be terribly bad. You won’t always get decent results when you try to arbitrarily reduce the dimensionality of your data to three just so you can make pretty graphs. But it sure is fun – and it can be useful, both for exploring data and communicating results.

Let’s set up and look at our data:

# this is a little tweak so that things line up nicely later on
iris$Species <- factor(iris$Species,
                       levels = c("versicolor","virginica","setosa"))

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

We might take a look at how correlated the four variables are.

round(cor(iris[,1:4]), 2)

#              Sepal.Length Sepal.Width Petal.Length Petal.Width
# Sepal.Length         1.00       -0.12         0.87        0.82
# Sepal.Width         -0.12        1.00        -0.43       -0.37
# Petal.Length         0.87       -0.43         1.00        0.96
# Petal.Width          0.82       -0.37         0.96        1.00

Sepal length, petal length, and petal width all seem to move together pretty well (Pearson’s r > 0.8) so we could possibly start to think that we can reduce dimensionality without losing too much.

We’ll use princomp to do the PCA here. There are many alternative implementations for this technique. Here I choose to use the correlation matrix rather than the covariance matrix, and to generate scores for all the input data observations. (My only reference is SAS help, but basically it seems that using the correlation matrix sort of helps you worry less (vs. using the covariance matrix) about normalizing all your variables perfectly.)

pc <- princomp(iris[,1:4], cor=TRUE, scores=TRUE)

The results are stored in pc and we can examine them in a variety of ways.


# Importance of components:
#                           Comp.1    Comp.2     Comp.3      Comp.4
# Standard deviation     1.7083611 0.9560494 0.38308860 0.143926497
# Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
# Cumulative Proportion  0.7296245 0.9581321 0.99482129 1.000000000

Here we see that the first three components bring our cumulative proportion of variance to 0.99 already, which is nothing to sneeze at. You can get a similar sort of idea from a scree plot.


scree plot

Heck, in this case you might even think that just two factors is enough. We can certainly plot in two dimensions. Here is a biplot.



This is pretty interesting. You can see how the original variables behave relative to our principal components, which is sort of as we saw in the correlation matrix above. We only see in the directions of the first two principal components, however. In the case of the iris data we can already see pretty clear clustering here.

The loadings calculated by princomp are eigenvectors of the correlation (or covariance, your choice) matrix and stored in the loadings of the results (pc$loadings in this example). You may prefer to use singular value deomposition for your PCA, in which case you can check out prcomp instead of princomp.

Let’s start to plot in three dimensions. We’ll use the excellent rgl package, which you can install with install.packages("rgl") if you haven’t already. We’ll plot the scores along the first three principal components for each iris, and color by species.

plot3d(pc$scores[,1:3], col=iris$Species)

simple 3d plot

That plot will be interactive – click and drag to rotate, right click and drag or use the mouse wheel to zoom.

It doesn’t seem like there’s a pre-made function for this, but we can sort of hack together a 3D equivalent to the biplot by adding to our initial 3D plot. This looks reasonably decent:

text3d(pc$loadings[,1:3], texts=rownames(pc$loadings), col="red")
coords <- NULL
for (i in 1:nrow(pc$loadings)) {
  coords <- rbind(coords, rbind(c(0,0,0),pc$loadings[i,1:3]))
lines3d(coords, col="red", lwd=4)


You really need to interact with this plot to see how everything is laid out. It’s very much like the biplot from above, but the eigenvectors are drawn on the same axes as the data.

You may also be interested in doing some unsupervised clustering. There are a bunch of ways to do this. In this case we have a “correct” clustering – the three species in the data set – so we can see how close to correct we are. Here’s the popular k-means method:

cl <- kmeans(iris[,1:4],3)
iris$cluster <- as.factor(cl$cluster)

The random seed is set for reprodicibility and then we save the cluster assignments from k-means as a new column in the iris data frame. We can take a look at how well this works, visually and by tabulation.

plot3d(pc$scores[,1:3], col=iris$cluster, main="k-means clusters")
plot3d(pc$scores[,1:3], col=iris$Species, main="actual species")

actual speciesk-means clusters

with(iris, table(cluster, Species))

#        Species
# cluster versicolor virginica setosa
#       1         48        14      0
#       2          2        36      0
#       3          0         0     50

So k-means got all the setosa’s perfectly but made some mistakes with the other two species, picking far too many flowers for its cluster 1.

You may want to do some sort of hierarchical clustering. Here’s one way. (See also the Quick-R page on clustering.)

di <- dist(iris[,1:4], method="euclidean")
tree <- hclust(di, method="ward")
iris$hcluster <- as.factor((cutree(tree, k=3)-2) %% 3 +1)
# that modulo business just makes the coming table look nicer
plot(tree, xlab="")
rect.hclust(tree, k=3, border="red")


In this case, the result is very similar to the result from k-means, but just slightly better, catching all the versicolor:

with(iris, table(hcluster, Species))

#         Species
# hcluster versicolor virginica setosa
#        1         50        14      0
#        2          0        36      0
#        3          0         0     50

Of course with any clustering, you should probably think about how your variables are scaled before you start applying clustering methods, which I’ve just neglected here.

There’s much more you can do, with many more options and alternative techniques. Have fun!

Install Vowpal Wabbit on Mac OS X

I failed to get vw installed during the first lecture of NYU’s new big data course, but I’ve got it installed now. Here’s one way to do it:


  1. Install Xcode command line tools. This will give you Apple’s collection of compilers and so on. You could probably get other ones if you prefer.
  2. Install Homebrew. This seems to be the best Mac package manager at the moment. Once you have Homebrew everything else is easy to install.

Install steps (skip lines if you know you have things already):

brew install libtool
brew install automake
brew install boost
brew install git
git clone
cd vowpal_wabbit

At this point you have everything you need. You could link glibtoolize so you can refer to it as libtoolize, or you can edit the and put a ‘g' in front of libtoolize. Do one or the other. You’re ready to go through the normal make process.

make install

And you have vw at /usr/local/bin/vw. Easy!

The API the FEC Uses

[I wrote this for Gelman’s 365 Days in Statistical Lives project, but the Money in Politics Data Fest is tomorrow, and there is some small chance that this might be useful for someone. Excuse the flowery details. In one sentence: The FEC doesn’t officially have a public API, but they do have a publicly accessible API that you can figure out. Scroll down to code font to see an example.]

I work at NYU. My job title is Senior Data Services Specialist. Data Services is a collaboration between Division of Libraries and Information Technology Services, and what we do is support members of the university community, largely graduate students, in everything from “Where can I get some data?” to “How do I get this last graph to look right?”

So we get an email from a PhD student in politics:

I would like to download data on campaign spending from the website of the Federal Election Commission, I would like to download information on about 500 candidates for the House of Representatives, i.e. ca. 500 files. The files are under the category “Report Summaries” and should cover the “two-year period” labeled by “2010”. The files that I am looking for do not seem to be accessible through a single download. At the moment, the only way that I can download the information is therefore to enter the name of the candidate, e.g. “Akin” and download the information individually which takes a long time. Would you be able to help me to speed up this process, perhaps by writing a script which would download the files based on a list of candidates?

My first thought, of course, is that he must be wrong. But I go and look, and sure enough there doesn’t seem to be a combined file for quite what he wants. What the heck, FEC?

Generally, our librarians can help people find existing data sets, and our specialists can help people use software for their analyses. Web scraping is not really part of anybody’s job description, but… maybe we do it now? I like the internet, so I decide to give it a try.

Ideally, I can do better than browser-equivalent scraping, if I can figure out what’s really going on; the site gives you a CSV file after you click around for a while. Viewing source is useless, naturally. Using the Inspector in Chrome I can dig through a miserable hierarchy and eventually see that I need to know what the JavaScript function “exportCandCmteDetailCurrentSummary()” is doing. I look through a couple JavaScript files that I can see are getting pulled in. I look and I look but I can’t find a definition for this function.

I don’t even like JavaScript, but I know there’s an HTTP request in there somewhere that I can emulate without the horrible web site, so I decide to download Wireshark and just sniff it out. I should have started with this, because it’s super easy to find, and before long I can get the files we want, by candidate ID, using cURL.

curl -X POST -F "electionYr=2010" -F "candidateCommitteeId=H0MO02148" -F "format=csv" -F "electionYr0pt=2010"

Stick that in a little shell loop, and you can download all the files you want.

Time to meet with the patron. Of course he uses Windows, and after we wait for Cygwin to install, cURL is mysteriously unable to find a shared object file. Fine. We can use a lab machine, which does work. We get all the files downloading.

Then we have to figure out how to combine all the files together. The role of Data Services is really to help people do things themselves, not just do things for people completely. Teach a man to fish, Program or Be Programmed… The requester has worked a little bit in Python, with some help, to do similar things in the past, so that’s a possibility. It looks like he has Stata running on his laptop, and if he wants to use that I may have to hand him off to one of our Stata experts. But it turns out that he used R quite a lot while getting his Masters, so maybe we can use that.

Luckily for everyone, he’s reasonably competent with R and we’re able to start writing a script together that will loop through all the candidate IDs that he’s interested in and grab the data he needs from all the files. Of course the files aren’t 100% consistent in their format and contents, so there are some issues to work out. I get to teach him what grep is.

At this point, it looks like he’s going to be able to finish cleaning up his data and move on to his real focus – some sort of analysis that may shine light on the nature of politics in America. He may be the first person ever to do any sort of analysis with these particular numbers. At this point I can feel good about going to lunch.