Rambling toward a definition of Big Data

I was at Big Data Camp today and it made me think again about what Big Data is. Often this is phrased as “How big is big data?” and there are many answers. Sometimes the size issue is sidestepped to frame Big Data as related to structured vs. unstructured data rather than or in addition to raw size. There has been not quite a proliferation of contradictory definitions but at least a failure of one accepted definition to dominate.

I offer that Big Data is Big in the literal sense, but as is so often the case it’s a question of “big compared to what?”. There are appropriate points of comparison, but they are not the same for every problem domain, which leads to apparently differing answers as to the size of Big Data. The appropriate point of comparison is the size of data that some existing tool can handle.

Computers have been doing things like word counts for a long time, so if you’re doing counting words you might not think your data is big until it no longer fits on a single machine. Putting data on multiple machines is a popular class of big data approach. Some people will argue with you about how big a single machine can be, of course, and this approach of getting a really big machine and/or getting new tools that make better use of cores or disks should also be included as Big Data approaches.

Unstructured data can seem big incredibly quickly. This is not a failure of the offered definition here – this is because the existing tool for dealing with unstructured data is often “have a human look at it”. So if you have enough unstructured data that it’s not feasible to have a human read it all (or whatever) I would say that you could fairly call it Big Data even if it’s only a few megabytes. It’s differently big than word counts, certainly.

A common misconception is that doing word counts is a good replacement tool for having humans read essays, for example. If you think this then you begin to conflate to two foregoing examples. Tools for analyzing natural language are so primitive that this type of mistaken thinking is understandable, but it is not correct.

Upon reflection as I write this, I think that the difference in the sense of bigness is significant enough that unstructured data shouldn’t be considered big in the same way that terabytes are. I think it’s probably more appropriate to think of it as two separate axes, one of raw size and one of analysis complexity/difficulty. There’s probably a good two-by-two matrix figure in this.

Thank you, Big Data Camp NYC and Strata, for stimulating this line of thought.

Include explicit assumption tests in analysis code

Especially if you’re doing analysis of any importance and especially if the analysis, once created, will be run again and again on new or otherwise updated data, you should write tests in the analysis code.

In software engineering (see Martin’s Clean Code, among many possible references) there is the now-prevalent idea of testing: unit tests, Test-Driven Development (TDD), and so on. Some statistical languages (by which I mean R, and I’d be interested to hear of others) have unit test frameworks. R has RUnit and testthat (see White’s post and this bioconductor one). Tests help you by making components of functionality more explicit and by giving you more confidence that your software is functional even when you make changes later with consequences that you might not notice otherwise.

The testing in software development is mostly about confirming behaviors automatically that a human could check by hand: “My function for adding returns 5 when I ask it to add 3 and 2,” for instance. In analysis, the correct answer depends on data, and is not likely to be known, so the real risk of incorrect results comes from mistakes not in the machinery for calculating (usually) but in unexpected or unaccounted-for conditions that appear in the data being analyzed. Like software unit testing, there is some art in determining exactly which tests you should write, and the benefit of enhanced quality and confidence in that quality is also similar, but the kinds of tests are much more about making explicit your assumptions about your input data.

The best-case scenario in the case of an incorrect assumption in your analysis is that your code breaks nicely. (For instance, you reference a column of data by name which doesn’t exist for a new input file and execution of your analysis stops there with an error message.) It is much worse for code to proceed regardless, not alerting you to a hidden problem, and yielding results that are not correct. (For instance, a new classification is added some categorical variable which your analysis gleefully ignores, now ignoring 30% of cases.)

In some things R has this behavior built in. If you try to take the mean() of a vector with any missing values at all, the result will be NA. This forces you (or should, if you don’t just instinctually put in na.rm=T) to think about why there are missing values and how the missing values affect your analysis.

As you develop analysis code, you should be checking many things as you go along anyway. Writing your checks into the analysis code makes them more explicit, makes them visible to others who view your code, and importantly makes them repeatable when you change your code and when the input data changes.

You could use a full unit testing framework, but in R you can also get a lot of mileage out of the built-in stopifnot(). You pass as an argument something you think should be true. Often the combination with all() is particularly useful.

Here are a couple examples of things you might test and how you could test them in R:

1. Test that a data frame’s column names are exactly what you think they are. This is especially important if you ever use column numbers to refer to data rather than the column names – and in general, you want to know about it if the data you’re analyzing changes format at all; even if it’s just the ordering of columns, it could accompany other changes that merit manual investigation.

# check that the data has these three column names
stopifnot(names(aDataFrame) == c("expectedColumn", "expectedColumn2", "expectedColumn3"))

2. Test that a column (or columns) you think provide a unique identifier are actually all unique. This is often important when merging, and there are more checks you might use around merges to ensure what you get out matches what you’re expecting.

# check that a vector has all unique values
# check that some columns have all unique rows
stopifnot(all(!duplicated(aDataFrame[,c("aColumnName", "anotherColumnName")])))

3. Often you’ll assume that only certain values occur, which can lead to problems if you’re wrong, or if you become wrong at some point in the future. Make your assumption explicit!

# check that all values in a vector are one of three we're expecting
stopifnot(all(aVector %in% c("possibleVal1", "possibleVal2", "possibleVal3")))

4. There are occasions when you’ll want to be sure that exactly one of several columns is set. (This check uses the fact that in R TRUE evaluates to 1 when adding.)

# check that exactly one of three values is true everywhere
stopifnot(all(vectorOne + vectorTwo + vectorThree == 1))

Writing tests like these makes you articulate your assumptions more completely than you might otherwise, and the exercise of doing it can lead you to discover things you might otherwise have overlooked as well. If you think it couldn’t possibly be worth checking something because it is so certain to be true, a substantial fraction of the time you’ll find the most interesting surprises in your data when you do the check anyway. And when you re-run your analysis a month later and find that the new data doesn’t behave like the old data, you’ll be glad to find out right away and be able to update your analysis, rather than possibly relying on or even sharing results you don’t yet know are faulty.

A final note: People often say that they assume such-and-such is normally distributed, for regression and so on, and you might wonder how to check this. There are tests, such as the Shapiro-Wilk test (shapiro.test() in R) but in practice I have never seen real data that Shapiro-Wilk would let you call normal. You’re far better off disabusing yourself of the notion that anything is really normally distributed and either doing parametric work with this understanding or staying more descriptive and/or nonparametric. See also Cook’s excellent exposition on distributions.

Data Science is Learning from Data

There are a lot of unhelpful definitions of data science. To be a useful term, it needs a sensible meaning. This is what I mean by data science:

Data science is learning from data.

The check for whether you’ve just done data science is a two-part test:

  • Did you use data?
  • Did you learn something?

In some ways then, data science is generalized science – science without a specific field. On the other hand, sometimes a data scientist may not develop the “experiments” that generate the data, and in this case data science corresponds more closely to a subset or specialization of scientific skills.

An irate student of mine once argued that if data science is at all sensibly named, there should be hypotheses being tested. This often doesn’t seem to be the case. I agree that the scientific method is a beautiful thing, but I also think that a lot of good science has been and continues to be observational. The reason for an expedition to Galapagos was never to test a hypothesis on the existence of the Blue-footed Booby, for example. Often there’s much to be learned just in describing data.

If your machine is learning but you’re not, you aren’t doing data science. You may certainly be doing very good data engineering and solving real problems. Algorithms and their development is the field of computer science. Their application happens in the fields of software engineering and data engineering. Deep learning and so on work very nicely and solve real problems, but if there isn’t a finding that humans can understand, it isn’t data science. Data science can certainly use statistics and machine learning, but black box techniques are not generally helpful for human understanding.

It certainly isn’t about the size of the data. There are techniques you need in order to work with big data, but these are just techniques. The scientific method is not obsolete. It is true that more is different, but the way that it’s different is that it’s more. If there’s any change to science, it’s that there’s a backlog of analysis due to the large amount of data. But there again, the data that’s backing up is usually not the kind that comes from proper experiments. We’ll still need experiments.

This definition of a “data scientist” is not far at all from “data analyst”. It may be that the reasons to use the name “data scientist” instead range only from “sexier buzzword” to “distance field from know-nothing low-level and business analysts”.

Business abuses the term data science in two main ways. The first is understandable, since data science requires the use of some computer science and engineering techniques. But data science is not primarily about engineering (i.e., building) products. Data science could be involved, but most of that is engineering work. A more useless view from business folks is well explained by IBM: “[w]hat sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings”. Data scientist should not just be a higher-ranking business title than data analyst, and everyone should be able to communicate – unless what you’re really looking for is something like a data journalist, somehow parallel to the way science journalists communicate about science.

Let us go forth and learn about the world.


Tall Infographic: NYC Public School Grade Configurations

It’s not uncommon to try to categorize schools as elementary, middle, and high schools. This is a considerable simplification. Here’s a visualization of the grades each of the 1,818 NYC public schools are designed to serve:

school grade configurations visualization

It takes two schools to make one vertical pixel here. (There are 23 PK-6 schools, if that helps with scale.) There are 51 unique configurations. There’s a lot going on.

Note in particular the diversity of schools serving very young students – the so-called “early childhood” schools – at the top. Also note that NYC has three schools that will eventually serve students in grades 9-14, at the bottom there. The divide between schools that do and don’t offer pre-kindergarten programs is interesting as well. (Public pre-kindergarten is also offered by some non-school entities in NYC, not shown here.)

I think it makes more sense to do academic analysis at the grade level rather than the school level. Going directly to the student level would be better still.

The image is based on data from the 2013-10-13 NYC LCGMS file, looking at the “Grades Final” column, which should represent all the grades each school is designed to serve. (The “Grades” column represents grades actually served this year, and shows considerably more variety, since schools are “phasing in” and “phasing out” in New York.) Made with R.


Just-in-Time Teaching: Blending Active Learning with Web Technology

After Gelman‘s post got me interested, I wrote my own first post on JiTT and ordered this book. I’ve now read the book, and there’s a lot to like.

book cover

The authors are at pains to make clear that despite the use of technology, the approach aims to “humanize instruction” and make learning more constructivist, more of a team endeavor, more interactive and personal. Before the first chapter runs this quote from Ruben Cubero:

As you enter a classroom ask yourself this question: If there were no students in the room, could I do what I am planning to do? If your answer to the question is yes, don’t do it.

There’s emphasis on the learning of the instructor – learning what the students know and don’t know, what their needs are, and so on. This is the “Just-in-Time” of the approach: changing how you teach on the basis of what you learn about your students in the hours leading up to a class session. The approach is one of the best answers I’ve seen to the question of differentiation, which often jumps prematurely to “How do you accommodate your students’ needs?” before handling “How do you know what your students’ needs are?” This quote appears in a footnote:

“Thinking in terms of how much the student is learning as opposed to how much material has been presented is a fundamental and necessary shift in perspective” [Sutherland and Bonwell, 1996, p. 32].

Differentiation also comes through the flexibility and extensibility of a wide range of good questions and activities, including open-ended essay questions.

I came to reflect on the incredible importance of great questions. It’s made me reconceptualize what a teacher is: a collector and sharer of truly great questions. (The level of thought required to come up with great questions is considerable – see Lesson Study, as in Stigler & Hiebert.) Novak et al. emphasize also Fermi problems – fun estimation problems that make you think flexibly. “How many ping-pong balls fit in a 747?” and so on. They link to this list.

The approach also accesses students’ prior knowledge, builds interest, and of course also hopes to encourage more students to actually do assigned readings. It aims to create shorter feedback loops and encourage students to ask questions in class and in office hours who might otherwise not. Each short assignment includes a comment box that invites students to add thoughts or questions that they might otherwise think were not welcome. There’s also a focus on fostering problem-solving skills, and some of the questions that the authors suggest teachers might ask are reminiscent of Polya.

The authors recommend a good deal of essay and short answer questions, but they do have some interesting thoughts on formative rather than summative multiple-choice questions as well:

In many instances, multiple-choice questions are the most concise way to induce a fruitful discussion. Unlike the essay or estimate formats, a multiple-choice question can point out the richness of a physical situation and drive students to consider many possibilities. Furthermore, because it is possible to include choices that “sound good” but play to students’ misconceptions, these questions can be used to lay traps that students would resent on a test but will remember in the classroom.

These questions are excellent discussion starters. The richness of the situation can be explored by considering what is wrong with each of the incorrect choices. Further discussions arise from considering the conditions under which each choice could be correct. Choices that are based on common misconceptions are particularly valuable.

And even for more flexible question types, the authors note that:

Although such questions are rich in possibilities, they require only a minimal user interface.

Some of the authors recommend pre-class web assignments with three questions: an essay question to be answered in three to four sentences, an estimation question with a short answer, and a multiple-choice question. Other authors offer an approach with just three to five questions that test concepts from the reading assignment. Both have some connection to a reading assignment, but the latter is more evaluative and might be graded on correctness, while the authors in the main seem to recommend grading on completion/effort rather than correctness for these pre-class activities.

The authors make very clear that they mean their pre-assignments to drive the whole class; they do not stand alone. They emphasize sharing and discussing anonymized, perhaps even edited student responses during the class session, pointing out that if feedback is positive this can help drive engagement and a feeling of involvement for all students. Several times the authors also recommend individual instructor feedback outside of class time, for example contacting individual students who have not submitted work or helping students who have particular questions that don’t fit into the class session. As they point out, the use of technology in the JiTT process is not for the purpose of reducing the amount of time that instructors need to devote to their classes.

This passage is interesting – the authors are writing in 1999, but the concept is still only now being implemented by the likes of Khan Academy and Knewton. Remember, this was written back when XMLHttpRequest was a new ActiveX object in IE 5:

A Web page link leads a student to a CGI that creates a login page if the student has not been previously authenticated. The student fills in identifying information (perhaps authenticating with a password), and the Submit button then takes the student to a CGI, which creates an assignment tailored to that student. That CGI may look up questions selected from a database, modify the selection based on the student’s history, and personalize some elements of the question. The student receives the page and responds to the questions, and on submitting the form, a CGI records the answers and creates a page of appropriate feedback.

I think this is still an exciting idea. I’d like to see more of the JiTT philosophy flourishing across all levels and fields of education. For all the good, I do have some gripes with the book and how things have developed since its publication, which I’ll leave down below. Feel free to not read them!




  • So old: Refers in the forward to “SME&T disciplines”. (Gosh, how did that great initialism lose out to STEM?) Other strange phrases: Integrated Development Editors? (Were they not Environments back then?) “Internet shopping market”??? And just using “World Wide Web” all the time.
  • So old but I love it: Java is introduced as being like Smalltalk. Perl and HyperCard are the languages of choice for writing CGI programs. Takes me back.
  • So old: They bet pretty heavily on Java Applets being the wave of the future. Oops.
  • DIY = no implementation: The authors basically say you should run your own web server on your personal computer and write your own CGI scripts to implement what they’re suggesting. They do point out that you could theoretically do it all through email alone (!) but the likelihood of anyone doing that is similarly low. The idea of making a tool that everyone could take advantage of doesn’t seem to have occurred to anyone, with the exception of the following, which isn’t really a JiTT system.
  • Gone corporate: The authors mention WWWAssign, which was apparently freely available online back then. More interestingly, they write on page 162: “We are currently writing a new version, WebAssign, which will be available in both free and commercial versions.” Today I see no traces of WWWAssign, but WebAssign has become a fairly substantial company. There doesn’t seem to be a version that is free in any meaningful sense. The facebook page for hating WebAssign also may suggest that WebAssign didn’t end up with the pedagogical perspective that JiTT aspires to.
  • Abandoned by publisher: Currently unavailable for $52.40 from Pearson Higher Ed. (You can’t even reach the Prentice Hall site on the cover.) It’s $44.26 from Amazon. The printing/binding suggest it should cost perhaps $7. A note in the preface seems to suggest the book might have come to exist only at the urgings of Prentice Hall, but there doesn’t seem to have been any continuing support. True, it’s been a while since 1999, but today if you visit the book’s URL as listed in the book you get an ancient Prentice Hall site for a different book called “Developing Professional Applications for Windows 98 and NT Using MFC“. How does that happen? Closed-minded academic publishing focused on profit and obsessed with copyright may have contributed to the failure of good ideas to thrive: None of the example questions can be re-distributed, and an entire chapter of questions cannot be used at all, even in individual classrooms, without the written consent of the publisher. I’ve never seen a place where a good Creative Commons license was so badly needed. The page for textbooks with JiTT that so needed copyright protection now has two (out of three) working textbook links, none of which seem to have the materials promised.
  • The image below appeared in my browser today (2013-10-12). What happened? This is a Planet of the Apes moment for me:

relic of the past

Infographics are dead – Long live information graphics!

The word “infographic” has come to connote bad design in combination with comically low or even negative data density.*

A recent blog post contrasts “infographics” with “data visualizations”.** A recent xkcd expresses a similar critical sentiment, directed particularly at “tall infographics” which require scrolling through their colorful information deserts. The WTF Visualizations tumblr collects examples of bad information design, most from what would be called infographics.

The “information graphic” of Bret Victor’s 2006 Magic Ink paper is the elegant product of Tufte’s “information design” and the best way to conceptualize effective “information software”. An information graphic is made “to display a complex set of data in a way that [the viewer] can understand it and reason about it.” “Show the data.” “It is for learning.”

While the pejorative sense of “infographic” is dominant now, let’s remain committed to the ideal of good communication through information graphics.



* Negative data density is achieved when incorrect or confusing communication leads to a net loss in human understanding.

** There are two good points in the qunb post about visual perception and increasing data-ink ratio. This is standard Tufte/Few fare. Most of the post seems to be about defining infographics as made by particular tools, like Photoshop, and specifically as static images. Upon trying out their demo product, it’s clear why. If you don’t attend closely to their complete definition of an infographic, you’ll think that the qunb “data stories” product is… an animated slide-show of infographics.

Just-in-Time Teaching (JiTT) very cool but hamstrung by lousy name and lack of implementation

Gelman posted a 15-step process for using Google Forms* (now part of Google Drive, of course) to set up web-based pre-class questions for students to complete – a key element of Just-in-Time Teaching (JiTT). (Gelman uses “jitt” also to refer to a question or group of questions that constitute an assignment.)

The “just-in-time” of JiTT refers to engaging students just before a class and accessing that interaction to customize instruction and encourage active thought and discussion. It’s similar to the entrance and (more commonly) “exit tickets” that I’ve heard about and sometimes used. (JiTT, as an acronym or expanded, is a horrible name for this.)

The definitive publication on JiTT seems to be Just-in-Time Teaching: Blending Active Learning with Web Technology by Novak et al., published way back in 1999. That era seems to be when most of the implementation was done as well: example.

It’s a pity that this approach doesn’t have better backing and good implementations for people to use. Why should teachers who want to do this have to go through 15 steps of rigamarole with Google Forms? Why does this approach seem to be languishing, broadly? The web has come so far since 1999, but even the wiki page for JiTT seems to be a backwater. (I corrected two typos, and there are more issues there…) I suspect that JiTT’s origins in post-secondary education, where professors often advance their careers primarily through research in their specialty (Novak is a physicist) rather than by improving teaching methods (Novak is not in a school of education) is part of the unfortunate story.

I would like to see this problem corrected. There should be a good tool for this online somewhere.


* About Google Forms: If the first step in your user experience is selecting a theme, your design has failed.