Easy AWS EC2 Ubuntu Quick Start

These are my notes on quickly setting up an Ubuntu Linux instance on Amazon EC2, in this case specifically to create a short-term collaborative shared environment. This has become ridiculously easy, but things done only occasionally are quickly forgotten, so I’m recording them here.

You’ll need to already have an Amazon account set up for AWS, which probably includes putting in some billing information. Go to your AWS Management Console and click EC2 and then the big blue “Launch Instance” button. The quick start AMIs are probably fine; at the time of this writing I chose the free tier eligible Ubuntu Server 13.10 64-bit option.

There are some options to adjust or not. Micros are free. At least port 22 needs to be open for SSH. Storage can be tweaked or not. At some point you’ll be asked to choose or create a key pair. On a Mac it’ll likely be easy to save the something.pem file to ~/Downloads/.

# Optionally first move the key to a natural place, e.g.,
mv ~/Downloads/something.pem ~/.ssh/

# Make the key file "safer" to satisfy ssh
chmod 400 ~/.ssh/something.pem

By now your instance should be running and you can find its IP address in the Running Instances section of the AWS Management Console for EC2. You could also set up an Elastic IP address through the EC2 Management Console at this time. It’s nice to add a line to your /etc/hosts file like myEC2box EC2.IP.ADD.RES (tab separated) so you don’t have to type the IP address all the time, but this can be skipped too. Now you can connect to the instance via SSH:

ssh -i ~/.ssh/something.pem ubuntu@myEC2box
# or
ssh -i ~/.ssh/something.pem ubuntu@EC2.IP.ADD.RES

Now you should be logged in to the remote machine. Subsequent steps would almost all require sudo, which is annoying, so let’s live dangerously and get to business:

sudo su -                      # not generally recommended
apt-get update
apt-get install emacs24 git    # and/or whatever you want installed

And now we undermine the default security options a bit for convenience.

emacs /etc/ssh/sshd_config
# Edit so that it has "PasswordAuthentication yes"
reload ssh

Now we’ll be able to ssh in without having key files all over the place. We can add some user accounts like this:

adduser --gecos "" aaron    # avoiding GECOS prompts

# At least one user should be a sudoer
adduser aaron sudo

At this point you should log out and log in as one of the new users with sudo privileges.

# SSH in as one of the new users
ssh aaron@myEC2box

# Change password if desired:
passwd

# Delete the default account:
deluser ubuntu

Totally optional, but the Ubuntu MOTD is annoying:

rm /etc/update-motd.d/10-help-text /etc/update-motd.d/51-cloudguest
emacs /etc/update-motd.d/50-landscape-sysinfo
# Add "--exclude-sysinfo-plugins=LandscapeLink" to appropriate line

Within ten minutes the MOTD should be less annoying. The machine is ready for action! Make accounts for your friends and have an EC2 party!

Daily Rituals is sort of inspiring

Be regular and orderly in your life so that you may be violent and original in your work.

daily rituals coverSomehow I saw that Sam Harris was recommending this book. It looked interesting, so I bought and read it, though much of the material can be read on the blog it started as. It seems to have enjoyed some success now, getting press coverage here and there.

There are a lot of idiosyncrasies described, but the most common thread seems to be one that is rather at odds with the bite-sized micro-sectioning of the book: many many productive people are productive by focusing on work for stretches of three or more hours at a time, as close to undisturbed and undistracted as possible. Nobody seems to write novels while reading buzzfeed lists and watching videos on youtube.

Another interesting tendency is that a lot of people consume a lot of caffeine, and a fairly large number even use amphetamines (it isn’t just Erdős). I’m not sure I’ll start going in for the harder stuff, but it makes me feel better about drinking Red Bull every now and then.

Data from naldaramjui.com

The summer of 2011 I made this web site called naldaramjui, which means flying squirrel in Korean. It’s mostly an interface to the Korean government’s test of proficiency in Korean, TOPIK. Just lots of multiple-choice questions. I had always sort of intended to make it even more awesome and do something with the data recorded from people using it, but life happened and Google App Engine is kind of a pain anyway, so I never got the data out – UNTIL NOW.

I think it could be super fun to spend some time analyzing this data properly. Doing the most naive analysis, I can find that the hardest question is advanced listening question 18, and the easiest question is beginner grammar question 17. In the case of the beginner question, I think it might be that it’s at the sweet spot where both it isn’t terribly hard and only people who are pretty good for beginners make it to question 17 – that’s the kind of thing that makes the simple analysis silly. Oh yes, this is going to be a fun data set to play with. Feel free to join in – data is on github!

Aims of Education

Bret Victor put up his links for 2013 and one of them particularly caught my eye: Why education is so difficult and contentious, by Kieran Egan, which says that there are three conflicting goals of education. This reminded me of a paper I wrote back when I was getting my master of arts in teaching. I was able to find the paper, and sure enough I had been responding to related writing also by Egan. The difficulty with identifying the goal of education, I think, is that it’s ultimately a pretty big question. What is the meaning of life? Why are we here? In my paper I tried to move toward articulating a coherent goal of education. I think the paper did not really succeed completely, but it was a step. Here it is, exactly as it was in 2006:

blank

The Aim of Education

blank

Man is a tame or civilized animal ; nevertheless he requires proper instruction and a fortunate nature, and then of all animals he becomes the most divine and most civilized ; but if insufficiently or ill educated, he becomes the savagest of earthly creatures.  Wherefore the legislator ought not to allow the education of children to become a secondary or accidental matter.

(Plato, quoted from Laws in Frank, 1947, p. 291)

blank

The ideal aim of education is creation of power of self-control.

(Dewey, 1938/1997, p. 64)

blank

The safest general characterization of the [Western] philosophical tradition is that it consists of a series of footnotes to Plato.

(Whitehead, 1929/1978, p. 39)

blank

Education as an institution, and particularly our typically requisite kindergarten through twelfth grade, is largely accepted as a necessary and good thing.  The mission of the U.S. Department of Education includes “assuring access to equal educational opportunity for every individual” and promoting “improvements in the quality and usefulness of education” (Department of Education Organization Act, 1979).  It does not specify, however, what use it is that education should have.  This ambiguity of purpose can cloud discussions about education.  To evaluate how well a system achieves its goals, it must first be agreed what the goals are.

Several considerations should inform a discussion of education’s purpose.  First there is the question of how many such purposes there may be.  If there are multiple objectives then in order for education to succeed they must be compatible.  Second, philosophy and pedagogy must remain clearly differentiated.  An aim of education is not the same thing as a pedagogy.

The underlying question of education’s fundamental aim has unfortunately been neglected and misunderstood.  Students ask why they have to go to school, and teachers should have a good answer.  This paper considers the work of educational theorists’ in order to identify a consistent philosophy of education.

Egan (1997) considers the modern school as developing contemporaneously with the modern hospital and prison.  Concerning prisons, he asserts that they have in the West “two aims – to punish and to rehabilitate,” and that the incompatibility of these aims leads to difficulties with the system’s implementation (p. 10).  However, it could also be said that the single aim of the modern prison is to reduce crime, in which case there is no conflict in the ends but only in the means.  Alternatively, it could be said that the aim of the prison is simply to house certain individuals away from the general population, for reasons and durations determined by other systems.  It is important to articulate goals at an appropriate level of specificity.

Another modification to Egan’s aims for prisons would be to classify rehabilitation as punishment, in which case there is one goal and it is punishment, or alternatively that punishments contribute to the one goal of rehabilitation.  Either argument, if accepted, collapses Egan’s two aims to just one and eliminates the alleged conflict.  If two goals are truly distinct, it is important to establish this clearly.

After prisons, Egan goes on to construct his dilemma for education, saying that the modern school has not two but three conflicting aims.  The first of these is socialization, “the homogenization of children” and the production of “a skilled workforce of good citizens” (p. 11).  The second aim is labeled Platonic and focuses on “learning those forms of knowledge that would give students a privileged, rational view of reality” (p. 13).  Egan attributes the third aim to Rousseau, with “focus on fulfilling the individual potential of each student” in accordance with “the nature of students’ development, learning, and motivation” (p. 15-16).  After arguing the incompatibility of these three, Egan suggests a fourth of his own, sketching a sort of Vygotskyan recapitulation theory.

There are several problems with Egan’s taxonomy.  Prime among these is a forced mischaracterization of Plato, with whom Egan groups E. D. Hirsch.  Even in Egan’s own words, Plato hoped for students to acquire “the ability to reflect on ideas, to pull them this way and that,” not only to acquire an inert body of knowledge (p. 13).  Plato and Hirsch may be similar in as much as they recommend a curriculum, but they have very different aims.  Hirsch’s ideas clearly fall in Egan’s first category of socialization – his concern is with transmitting a homogenizing “cultural literacy” in order to ensure, for example, a “common reader” (Hirsch, 1987).

Plato’s goals for education are being confused with his pedagogy in Egan’s interpretation.  The aims should be considered independent of the nature or effectiveness of the curriculum.  Regarding his underlying philosophy of education, scholars of The Republic have noted that Plato was intensely aware of the importance of the environment in the development of individuals, and that since the uncultivated environment is often imperfect, it “can be counteracted only by creating a power for good as penetrating, as unconscious, and as universal; and to do this is the true function of a public system of education” (McClintock, 1880/1968, p. 7).  Bosanquet offers that Plato’s school is “the space and atmosphere needed for the human plant to throw out its branches and flowers in their proper shape,” which sounds very much like Egan’s third category (1900, p. 12).

Indeed, Egan’s second, “Platonic” category is a phantom, on the one hand aggrandizing socialization and on the other trivializing the developmental goals shared by Plato and Rousseau alike.  Egan shares these goals as well, and the “new idea” he offers is pedagogical, a different route but not a different destination.  Egan notes that Rousseau saw his work as “a kind of supplement to The Republic,” and upon examination it is clear that there is, beyond pedagogy, no underlying conflict between Plato and the theorists of Rousseau’s tradition (p. 15).

After clarification of Egan’s four aims there are only two.  The first is the expansion of the socialization category to include the transmittance of any inert knowledge, social or otherwise.  The important note here is that transmittance of knowledge as a goal is different from the use of transmittance of knowledge to achieve other goals.  This goal of transmittance is very specific.  It is the goal of Hirsch, the philosophy of Thorndike, the Lockean model of “the child as empty vessel or blank slate” (Lillard, 2005, p. 9).

The second category of aim is that of Rousseau, Plato, and Egan, which has yet to be fully articulated.  The lack of explicit language about this alternative fundamental aim means that in some sense “educators must establish new goals for learning” (Grabinger, p. 667).  This process can begin with a review of the major relevant educational theorists starting, as Egan indicates, with the work of John Dewey.

Dewey, “the man acknowledged to be the pre-eminent educational theorist of the twentieth century,” has conspicuously little to say about fundamental aims in his Experience and Education (1938/1997).  It reflects the characteristic neglect of the issue that there is so little in this work explicitly addressing it.  One might think the reason for this is that it is taken for granted, as a closed question, but it is then a bit surprising when Dewey explicitly claims that “The ideal aim of education is creation of power of self-control” (p. 64).

This unexpected announcement comes in a discussion of Dewey’s notions of social control and the true nature of freedom as freedom of thought, not freedom of action.  He argues that thinking is “a postponement of immediate action” (p. 64).  Much more recent evolutionary theory such as that espoused by Devlin suggests that “off-line thinking,” thought not connected to physical action, is a key factor differentiating humans from other animals (2000).

Other more recent educational theorists have echoed Dewey more or less subtly, as illustrated for example in the title of Bandura’s 1997 Self-Efficacy: The Exercise of Control.  It was also Bandura who said that “self-reflection is the most uniquely human characteristic,” (qtd. in Pajares) aligning with Rousseau’s naturalism in the modern theory of Devlin and others.  This focus on self-control as the product of education, connected as it is to evolutionary psychology, can then be viewed as a result of following Rousseau’s advice to “fix your eye on nature [and] follow the path traced by her” (qtd. in Egan, p. 16).  Piaget said that scientific knowledge is “drawn in large part from common sense,” referring to his fundamental hypothesis of genetic epistemology, “that there is a parallelism between the progress made in the logical and rational organization of knowledge and the corresponding formative psychological processes.” (1968)  This view of knowledge implies that anything one might learn is in some sense a result of the nature of humanity – an interesting formalization of the view that education makes us “more human.”

The two goals then can be contrasted as external versus internal.  The first goal, previously called socialization, is to transmit external knowledge into the student, enforcing social rules, for example, from the outside.  The second goal, of Rousseau, Dewey, and the rest, is to develop the student’s natural internal mental faculties.  One might question the extent to which these goals are incompatible, or, if one were Hirsch, whether the second goal is realistically possible, meaningful, or adequate.

Dr. Maria Montessori experimentally developed a system of education that focuses on the work of students, with minimal transmittance in the usual sense of moving information from teacher or textbook into the student.  She observed that in the proper environment discipline “sprang up spontaneously” and that from a student body of children judged passive or domineering, good or bad, “there remain[ed] only one kind of child” (1967, p. 202).  She called this phenomenon “normalization,” and characterized its discovery as “the most important single result of [her] whole work” (p. 204).  In as much as the goal of Hirsch is a behaviorist goal, concerned with instilling good (socially acceptable, beneficial, hard-working) behavior in students, Montessori provides empirical observations that such behavior can arise from students themselves, if the school setting is appropriately constructed.  Plato’s interpreters would agree that in the “proper space and atmosphere” emerges “this divine humanity which is in the truest sense the self” (Bosanquet, 1900, p. 12; McClintock, 1880/1968, p. 22).

Our faith in human nature can extend beyond behavior of this kind, and in particular to language.  Human children learn to speak and understand their native language without being explicitly taught – an often neglected marvel.  Montessori herself noticed “the many wonders of the language mechanism,” anticipating in some ways the modern linguistics often identified with Noam Chomsky, that “there is something special about the human mind that equips it to acquire language” (Montessori, 1967, p. 116; O’Grady, Archibald, Aronoff, & Rees-Miller, 2005, p. 390).  In this view, language is not learned only from external sources, but is to a degree inborn in all humans as a “Universal Grammar.”  This theory is supported by a number of arguments, as well as the apparent existence of a critical period for language learning (O’Grady, Archibald, Aronoff, & Rees-Miller, 2005).

One’s native language is typically learned before schooling begins, but the nativist idea extends beyond language.  For example, the principal claim of Devlin’s The Math Gene is that “the feature of our brain that enables us to use language is the same feature that makes it possible for us to do mathematics” (p. 70).  Since everyone has the capacity for language, everyone has the capacity for mathematics – a traditional bastion of education.  The reason people appear to vary so much more in mathematical fluency than in linguistic fluency is then a combination of the differences in environments that people are exposed to and the unnatural forms that are imposed by traditional education on mathematics.

At least in these three areas – behavior, language, and mathematics – theory and research support the view that they are natural human abilities and so are definitely developed in accordance with the second or naturalist goal for education.  Generalizing broadly, it can be said that anything humans do, no matter how far removed from the struggle for survival, is by definition a naturally human behavior.  In this way anything one might want to transmit to a student was produced by someone, and could also be produced by the student.  There is no real conflict between “natural” and “school” capabilities; the second goal can achieve the aims of the first as well.

The converse, that inculcation of knowledge should develop the mental faculties of the student, is not so clear, and in fact it seems not to be true in general.  Research indicates that “knowledge acquired in abstract circumstances without direct relevance to the needs of learners is not readily available for application or transfer to novel situations” (Grabinger).  It seems that learning many facts or examples does not always lead to mental generalized concepts or what Beers and others call metacognitive skills.  As Beers asked a teacher who had just spent a class explaining one short story to her class, “what did you show them that they could use with another story?” (Beers, 51)

Standardized tests provide a good illustration of this point.  Although of course such tests vary in quality, test-takers with advanced cognitive skills provided even meager content exposure can generally be expected to test well.  Those with extensive externally collected content knowledge may have positive results on the relevant tests, but will not necessarily develop cognitive skills in the process of internalizing this data.  Their knowledge is not a transferable, applicable, and hence not a useful, thing.  So while well-written standardized tests may provide a rough measure of cognitive ability, they encourage or at least allow the test-specific alternative.

Important mental skills are not served well by this first goal.  As Grabinger reports, there is a difference between intentional and incidental learning (p. 672).  Hirsch’s own 1898 dissertation research showed that a teacher’s instruction in one specific task did not help students perform other related tasks.  Hirsch concluded that the problem was with the human mind rather than with the nature of the instruction or evaluation.  The Deweyan goal for education provides a necessary alternative to this bleak outlook.

Jerome Bruner “take[s] as perhaps the most general objective of education that it cultivate excellence” and explains that this goal means “helping each student achieve his [or her] optimum intellectual development” (1960/1977, p. 9).  What is meant by “excellence” or “optimum intellectual development” is not necessarily well-defined.  Firstly, “optimum” may imply a stopping point.  This language can however be alternatively understood to not contradict the common sentiment that one should always “desire to go on learning,” so that it is this process of learning that proceeds optimally rather than merely reaching an optimal level (Dewey, 1938/1997, p. 48).  Secondly, specifying each student’s optimum begs the question of how much students might vary and how this should be taken into consideration.

Since education cannot very well affect the genetic nature of students, one would hope that the natural endowments of students would not vary too wildly, and this appears to be the case.  The three examples above, most notably the cognitive capability for mathematics, are human universals.  Humans are genetically more similar than different, and the degree to which internal characteristics appear depends largely on the outside environment (Ridley, 2003).  Across many fields, “the preponderance of psychological evidence indicates that experts are made, not born” (Ross, 2006).

Having analyzed the developmental aim of education, some consideration of how it may affect the substance of education is in order.  Neither category of goal specifies a particular classroom, content, pedagogy, and so forth, but the second goal does not even specify that there might be a fixed curriculum at all.  However, neither does it preclude fixed content or systems of instruction – in fact all the major theorists demand structure of some kind in education (eg. Dewey, Montessori).  While it is not specified that schools should include a fixed list of disciplines for study, the second goal certainly allows for them, and provides an overarching principle for organizing and motivating their instruction.  Considering the example of language, we see that children learn to speak and understand through immersion in speech communities in which they are free to participate fully – this can provide a model for what we call “learning communities.”  Schools provide communities that might not be available elsewhere, identified by rigorous, disciplined thought, where teachers should provide examples of the appropriate fluencies.

The educational goal of developing students’ intellectual abilities is well-founded in this analysis.  Students possess natural capabilities that may be brought out by education, and in fact anything that might be taught can be so developed naturally.  It is a more robust goal as compared to the transmittance of knowledge alone, since it can effectively encourage the other while the reverse is questionable.  It is a goal that can benefit all students, and it is the goal advocated, explicitly or not, by nearly all modern philosophers of education.  Arriving at this goal provides a solid foundation from which to build educational programs and evaluate existing systems, but it is clearly a starting and not a finishing point.  Important questions of what intellectual faculties are and how they can be developed remain, only hinted at herein.  As with all things, it is important to identify and understand the questions before one can reasonably hope to find answers.

blank

We are all aware, probably, that the word “school” is derived from a Greek word meaning “leisure.”  This conception of “leisure” is one of the greatest ideas that the Greeks have left us.  It is not that of amusement or holiday-making.  It is opposed both to this and to the pressure of bread-winning industry, and indicates, as it were, the space and atmosphere needed for the human plant to throw out its branches and flowers in their proper shape.  “To have leisure for” any occupation, was to devote yourself to it freely, because your mind demanded it ; to make it, as it were, your hobby. It does not imply useless work, but it implies work done for the love of it.  In the modern world leisure is a hard thing to get ; and yet, wherever a mind is really and truly growing, the spirit of leisure is there.  It is worth thinking of, how far in education the idea of the growth of a mind can be made the central point, so that the things which are considered worth teaching may really have time to sink into and to nourish the whole human being, morally and intellectually alike.

(Bosanquet, 1900, p. 11-12)

blank

References

Beers, K. (2003). When kids can’t read: What teachers can do. Portsmouth, NH: Heinemann.

Bosanquet, B. (1900). Introduction. In The education of the young in The Republic of Plato (pp. 1-23) [Introduction]. London: Cambridge University Press.

Bruner, J. S. (1977). The process of education. Cambridge, MA: Harvard University Press. (Original work published 1960)

Department of Education Organization Act, 20 U.S.C. § 3402 (1979), http://www4.law.cornell.edu/‌uscode/‌html/‌uscode20/‌usc_sec_20_00003402—-000-.html.

Devlin, K. (2000). The math gene: How mathematical thinking evolved and why numbers are like gossip. Basic Books.

Dewey, J. (1997). Experience and education (Touchstone ed.). Kappa Delta Pi lecture series. New York: Simon & Schuster. (Original work published 1938)

Egan, K. (1997). Three old ideas and a new one. In The educated mind: How cognitive tools shape our understanding (pp. 9-32). Chicago: The University of Chicago Press.

Frank, S. (1947). Education of women according to Plato. In Plato’s theory of education (pp. 287-308). New York: Harcourt, Brace and Company.

Grabinger, R. S. (n.d.). Rich environments for active learning.

Hirsch, E. D., Jr. (1987). The practical outlook. In Cultural literacy: What every American needs to know (pp. 134-145). Boston: Houghton Mifflin.

Lillard, A. S. (2005). Montessori: The science behind the genius. New York: Oxford University Press.

McClintock, R. L. (1968). The theory of education in the Republic of Plato. New York: Teachers College Press. (Original work published 1880)

Montessori, M. (1967). The absorbent mind (C. A. Claremont, Trans.). New York: Holt, Rinehart and Winston.

O’Grady, W., Archibald, J., Aronoff, M., & Rees-Miller, J. (2005). Contemporary linguistics: An introduction (5th ed.). Boston: Bedford/‌St. Martin’s.

Pajares, F. (n.d.). Self-efficacy beliefs in academic contexts: An outline.

Piaget, J. (1968). Genetic Epistemology (E. Duckworth, Trans.). New York: Columbia University Press.

Ridley, M. (2003). The agile gene: How nature turns on nurture. New York: HarperCollins.

Ross, P. E. (2006, August). The expert mind. Scientific American. Retrieved August 20, 2006, from http://www.sciam.com/‌article.cfm?articleID=00010347-101C-14C1-8F9E83414B7F4945&ref=sciam&chanID=sa006

Whitehead, A. N. (1978). Part II, Chapter I, Section I. In D. R. Griffin & D. W. Sherburne (Eds.), Process and reality: An essay in cosmology (Corrected ed., pp. 39-42). New York: The Free Press. (Original work published 1929)

NYC Test Data

A series of posts analyzing publicly available New York City Math and English Language Arts (ELA) standardized test results. There are a lot of graphs. Code is on github.

  1. Putting the data together and looking at it
  2. Checking out the number of students tested in Math and ELA
  3. Checking out the number of students tested in Math and ELA again
  4. The total number of students and tests
  5. The total number of students and tests by grade
  6. Considering District 75 schools
  7. The total number of tests by grade viewed by cohort
  8. Number of students tested at the school grade subject level
  9. Normalizing the distributions of average scores
  10. Schools fight the Law of Large Numbers
  11. Changes in average scores for school grades and cohorts
  12. Changes in scores by year – where is the Common Core shake-up?

 

Clean data with R

This is the content of the talk I did for the 2014-01-08 meetup of Data Wranglers DC. I agree with Tufte on PowerPoint, so I wrote out most of what I wanted to say as a couple blog posts.

The slides are mostly goofy pictures in front of which to talk about the above. The first slide is blank:

Some theory and practice for data cleaning

Data cleaning can refer to many things. I’ll mention some important data structure ideas elucidated by Wickham, and then spend more time on the topic of assumptions and data coherence, and how data cleaning happens in practice. By the end, we may achieve some hope of getting reasonably meaningful results from our data.

Hadley Wickham has made available very good explanations (paper, slides, presentation) of what he calls “tidy data”. Tidy data has variables stored in columns, observations in rows, and a single type of experimental unit per dataset. This framework is both a good idea in its own right, and it also let’s you easily use a lot of very good data manipulation and graphics tooling in R – much of which is also due to Wickham.

The idea of tidy data provides guidance in how to begin restructuring data for analysis. What, then, are the data problems that are not fundamentally due to data structure?

Harry Frankfurt has a charming essay, published as a small book, called On Bullshit.

cover of On Bullshit

The book is quite funny, and it also provides a useful definition of bullshit: communicating without concern for the truth, or with “indifference to how things really are” (p. 34). It claims, and may be correct, that “bullshit is a greater enemy of the truth than lies are” (p. 61).

A lot of data has this bullshit lack of regard for facts, and it is this sort of data problem that I am principally concerned with. This class of problem can arise all throughout the data quality framework of McCallum, which calls for data to be complete, coherent, correct, and accountable, though typically coherence admits most easily of self-contained testing.

The vast majority of data problems are not intentionally introduced. They appear by accident or through miscommunication, and because they were not detected and corrected. It isn’t easy to completely assure the quality of a data set, and in any event there frequently aren’t resources allocated to thorough checking. Data can seem quite authoritative when viewed from a distance – why check it?

It would be nice to have at least two individuals (or teams) develop the same data product in parallel, collaborating on a shared set of business rules, so that they can check their final results against one another, and have a further QA step before release. But the model of a lone individual hacking out a spreadsheet without so much as a second look before release is unfortunately common. When it comes to open data initiatives, for example, it sometimes seems that data is scooped from a trough and thrown into the world.

Data dirtiness could be addressed at creation, and we should keep this in mind when creating data ourselves. Keep it clean! But as it is often not, it is incumbent on the analyst to be diligent in checking and addressing issues with the data under analysis. Be aware of the process you are a part of.

phenomenon of interest - data creation - data - analysis - system of beliefs

All of analysis is coming up with beliefs. These beliefs are based on implicit models. Probably the most common cause of initial results from data is error. This is especially true for interesting results, but it applies everywhere. For example:

$ wc -l data.csv 
       5 data.csv

A common reading of this would be that the file data.csv contains five lines, or rows. However, this is not necessarily the case:

> read.table("data.csv")
                               V1
1 I think...\nthat...\nit's good!
2                           dunno
3                       It's bad!

There are three rows here. What happened? There was a problem with our assumptions, or with the implicit model we used. The program wc does not tell us how many lines there are in a file. It tells us how many newline characters (“\n”) there are – how many times there is a byte that goes 00001010. The implicit assumption was that each line of data had exactly one newline character. It was a linear model: “number of lines in the file = 1 * number of newline characters in the file + 0”. Notice that the model is fine, and the result is also fine – if the assumption holds. But often, we aren’t even aware of the assumption, we aren’t aware of the implicit model, and we don’t check what we’re relying on.

The example with wc is about being aware of levels of abstraction and how they interact with your tools so as to not draw incorrect conclusions. The sorts of assumptions more closely related to what we usually think of as clean data issues are things like:

  • this is everything we should include
  • there’s exactly one record per entity
  • the line items add up to the subtotals
  • these are all in the same units
  • these dates are all in order

A good place to start with a new data set is by checking for problems that could not possibly occur. Very often, you will find that they occur anyway. A common example is checking unique IDs. They are frequently not.

When it comes to checking things, I recommend never checking anything once. When you become aware of something that should be true about your data, write a check that embeds itself forever in your code. Very occasionally there are performance concerns, but almost always correctness is more important. The “theory” here is that the strength of the computer – automating things – should be leveraged to reduce human cognitive load while increasing confidence in results. Include explicit assumption tests in analysis code.

Checks can also be included for things that need not be true – for intermediate results, totals, and so on. Analysis code may have many steps, and it may not be obvious how a change at an early step affects things downstream. Including something as simple as, for example, stopifnot(sum(data$total)==9713) at the bottom of a script will alert you if you introduce something that changes this – especially useful to know when you think you’re making changes that don’t.

Another way to make this point about coded checks is that comments in code should be avoided for describing data.

# There are 49 missing values.     # NO
stopifnot(sum(is.na(my.data))==49) # YES

The comment is instantly out of date if something changes, and likely forgotten. The code will automatically let you know if something changes. This is vastly superior.

Quite a lot of checks can be written very simply – for example, the often-neglected checks around merges (joins). Many more complex checks are included in the assertive package available for R, which has things like assert_all_are_credit_card_numbers and assert_all_are_isbn_codes, for example.

Many checks will be domain-specific, but it’s common to have univariate distributions to investigate. The tails (outliers) are often the result of some data problem, and we should be grateful when this is the case, because it’s much easier to notice than if the spurious data is throughout the range of correct values. Of course the tails are also important places to take care because they will have dramatic effects on many types of analysis.

This is a place where data visualization is an essential tool for data cleaning. Histograms can show some information about a distribution. I often prefer what Cleveland calls quantile plots (different from QQ plots) which can be easily made in R (with index rather than quantile labels) as plot(sort(my.data)). These plots do not depend on bin size – every data point is shown. Much more of the fine structure of a distribution can be quickly and easily taken in. This image shows a contrived data set two ways, illustrating the occasional advantage of quantile plots.

histogram vs quantile

Many data checks might be called “reality checks”. It is worth being aware of the danger of crossing over into “unreality checks” when our biases overrule the data. There is often an essentially Bayesian flavor to checking data. We suspect that a janitor’s salary of $100,000,000 is not correct because of our prior beliefs about the world. Probably we are correct to say that this data is bad. But if we are over-zealous in eliminating or adjusting data that seems wrong, we may end up eliminating or adjusting the most interesting correct data as well. Care must be taken to make choices that maximize the information content of our data.

When possible, fix data. For example, adult heights entered sometimes in feet and sometimes in centimeters can be easily adjusted to either by machine because their ranges do not overlap. This is probably better than dropping all of either type, but choices like this may not always be easy.

It’s not uncommon to hear it said that 80% of a data project is data wrangling work. But this doesn’t mean that you can do data cleaning Monday through Thursday and analysis on Friday. It is a good idea to think about data cleaning up front, but there are often practical concerns about the scope of the data cleaning work.

clean all the data!

> ncol(my.data)
[1] 1274

clean all the data?

In addition, it’s often the case that you discover data problems in the course of analysis. (Sometimes, this is the only way that data problems are discovered.) You likely can’t predict at the outset exactly which fields you’ll need or how problems will arise. A simple example is that two columns may each look fine, but if analysis calls for their ratio, absurd data problems can become evident.

Another concern is that data issues are often stubbornly non-general. We would like to aim to generalize, but may find that ad hoc solutions are sometimes unavoidable. We are left aiming to generalize, preparing to specialize.

Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles. As usual:

The difference between theory and practice is larger in practice than it is in theory.