The Information: a History, a Theory, a Flood

This is a really good book.


James Gleick is excellent. The history is beautifully researched and explained; there is so much content, and it is all fitted together very nicely.

The core topic is information theory, with the formalism of entropy, but perhaps it’s better summarized as the story of human awakening to the idea of what information is and what it means to communicate. It is a new kind of awareness. Maybe the universe is nothing but information! I’m reminded of the time I met Frederick Kantor.

I’m not sure if The Information pointed me to it, but I’ll also mention Information Theory, Inference, and Learning Algorithms by David J.C. MacKay. This book can be read in PDF for free. I haven’t gone all through it, but it seems to be a good more advanced reference.

The Information: Highly recommended for all!

Here Comes Everybody

Harlan mentioned this book so I read it.


It came out back in 2008 and was a lot more timely then, I imagine.

There are lots of interesting tidbits in here. It’s largely anecdote-based, and it uses the word “suasiontwice. Here are some quotes:

… large social systems cannot be understood as a simple aggregation of the behaviors of some nonexistent “average” user.

… it’s easier to like people who are odd in the same ways you are odd, but it’s harder to find them.

… trying something is often cheaper than making a formal decision about whether to try it.

… the question “Do the people who like it take care of each other?” turns out to be a better predictor of success than “What’s the business model?”

Shirky also brings up the Bill Joy quote, “No matter who you are, most of the smart people work for someone else.” This made me wonder whether Google agrees, these days.

I like reciprocal altruism a lot: “With reciprocal altruism, favors are exchanged without formal bookkeeping …” (emphasis mine). This is my preferred way of doing things. The problem seems to be the number of people and anonymity online, and so there are systems with formal bookkeeping like eBay’s buyer/seller rating system, or points on StackOverflow. Is this the direction that everything is moving in? If we end up with zero privacy/anonymity online, will that solve the problem of freeloaders and other bad behavior?

Things I hadn’t previously heard of: asmallworld (gross), Dodgeball (people are still doing this stuff). Also Richard Gabriel‘s Worse Is Better talk (increasingly it seems LISP people have all the ideas).

Maybe the most interesting bit from the book was this forward-looking claim:

So here’s a hypothesis about the near future, based on little more than a hunch and some tantalizing examples: we’re about to experience a revolution in collective action, and the driver of that revolution will be new legal structures that will support productive collective action.

I don’t know if that has happened, or if it is happening. Shirky pointed out that intellectual property was the main collective product at the time of his writing – things like Linux and Wikipedia, where licenses like the GPL protect the product. The only things I think of that are beyond software and writing are products that get kickstarted, for example, and I don’t know if that counts. Restricting to financial structures seems unfortunate. But crowd-funding and anonymous currencies like BitCoin might be the closest thing to steps in this direction, as far as I can see. Meetup was in the book, and doesn’t have any special legal structures for organizations as far as I know. What else am I missing?

Data done wrong: The only-most-recent data model

It’s not very uncommon to encounter a database that only stores the most recent state of things. For example, say the database has one row per Danaus plexippus individual. The database could have a column called stage which would tell you if an individual is currently a caterpillar or a butterfly, for instance.

This kind of design might seem fine for some application, but you have no way of seeing what happened in the past. When did that individual become a butterfly? (Conflate, for the moment, the time of the change in the real world and the time the change is made in the database – and say that the change is instantaneous.) Disturbingly often, you find after running a timeless database for some time that you actually do need to know about how the database changed over time – but you haven’t got that information.

There are at least two approaches to this problem. One is to store transactional data. In the plexippus example this could mean storing one row per life event per individual, with a date-time of database execution. The current known state of each individual can still be extracted (or maintained as a separate table). Another approach is to use a database that tracks all changes; the idea is something like version control for databases, and one implementation with a philosophy like this is datomic.

With a record of transactional data or a database that stores all transactions, you can query back in time: what was the state of the database at such-and-such time in the past? This is much better than our original setup. We don’t forget what happened in the past, and we can reproduce our work later even if the data is added to or changed. Of course this requires that the historical records not be themselves modified – the transaction logs must be immutable.

This is where simple transactional designs on traditional databases fail. If someone erroneously enters on April 4th that an individual became a butterfly on April 3rd, when really the transformation occurred on April 2nd, and this mistake is only realized on April 5th, there has to be a way of adding another transaction to indicate the update – not altering the record entered on April 4th. This can quickly become confusing – it can be a little mind-bending to think about data about dates which changes over time. The update problem is a real headache. I would like to find a good solution to this.


Bayes’ Rule for Ducks

You look at a thing.


Is it a duck?

Re-phrase: What is the probability that it’s a duck, if it looks like that?

Bayes’ rule says that the probability of it being a duck, if it looks like that, is the same as the probability of any old thing being a duck, times the probability of a duck looking like that, divided by the probability of a thing looking like that.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This makes sense:

  • If ducks are mythical beasts, then Pr(duck) (our “prior” on ducks) is very low, and the thing would have to be very duck-like before we’d believe it’s a duck. On the other hand, if we’re at some sort of duck farm, then Pr(duck) is high and anything that looks even a little like a duck is probably a duck.
  • If it’s very likely that a duck would look like that (Pr(looks|duck) is high) then we’re more likely to think it’s a duck. This is the “likelihood” of a duck looking like that thing. In practice it’s based on how the ducks we’ve seen before have looked.
  • The denominator Pr(looks) normalizes things. After all, we’re in some sense portioning out the probabilities of this thing being whatever it could be. If 1% of things look like this, and 1% of things look like this and are ducks, then 100% of things that look like this are ducks. So Pr(looks) is what we’re working with; it’s the denominator.

Here’s an example of a strange world to test this in:


There are ten things. Six of them are ducks. Five of them look like ducks. Four of them both look like ducks and are ducks. One thing looks like a duck but is not a duck. Maybe it’s a fake duck? Two ducks do not look like ducks. Ducks in camouflage. Test the equality of the two sides of Bayes’ rule:

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

\displaystyle \frac{4}{5} = \frac{\frac{6}{10} \cdot \frac{4}{6}}{\frac{5}{10}}

It’s true here, and it’s not hard to show that it must be true, using two ways of expressing the probability of being a duck and looking like a duck. We have both of these:

\displaystyle Pr(duck \cap looks) = Pr(duck|looks) \cdot Pr(looks)

\displaystyle Pr(duck \cap looks) = Pr(looks|duck) \cdot Pr(duck)

Check those with the example as well, if you like. Using the equality, we get:

\displaystyle Pr(duck|looks) \cdot Pr(looks) = Pr(looks|duck) \cdot Pr(duck)

Then dividing by Pr(looks) we have Bayes’ rule, as above.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This is not a difficult proof at all, but for many people the result feels very unintuitive. I’ve tried to explain it once before in the context of statistical claims. Of course there’s a wikipedia page and many other resources. I wanted to try to do it with a unifying simple example that makes the equations easy to parse, and this is what I’ve come up with.

Aims of Education

Bret Victor put up his links for 2013 and one of them particularly caught my eye: Why education is so difficult and contentious, by Kieran Egan, which says that there are three conflicting goals of education. This reminded me of a paper I wrote back when I was getting my master of arts in teaching. I was able to find the paper, and sure enough I had been responding to related writing also by Egan. The difficulty with identifying the goal of education, I think, is that it’s ultimately a pretty big question. What is the meaning of life? Why are we here? In my paper I tried to move toward articulating a coherent goal of education. I think the paper did not really succeed completely, but it was a step. Here it is, exactly as it was in 2006:


The Aim of Education


Man is a tame or civilized animal ; nevertheless he requires proper instruction and a fortunate nature, and then of all animals he becomes the most divine and most civilized ; but if insufficiently or ill educated, he becomes the savagest of earthly creatures.  Wherefore the legislator ought not to allow the education of children to become a secondary or accidental matter.

(Plato, quoted from Laws in Frank, 1947, p. 291)


The ideal aim of education is creation of power of self-control.

(Dewey, 1938/1997, p. 64)


The safest general characterization of the [Western] philosophical tradition is that it consists of a series of footnotes to Plato.

(Whitehead, 1929/1978, p. 39)


Education as an institution, and particularly our typically requisite kindergarten through twelfth grade, is largely accepted as a necessary and good thing.  The mission of the U.S. Department of Education includes “assuring access to equal educational opportunity for every individual” and promoting “improvements in the quality and usefulness of education” (Department of Education Organization Act, 1979).  It does not specify, however, what use it is that education should have.  This ambiguity of purpose can cloud discussions about education.  To evaluate how well a system achieves its goals, it must first be agreed what the goals are.

Several considerations should inform a discussion of education’s purpose.  First there is the question of how many such purposes there may be.  If there are multiple objectives then in order for education to succeed they must be compatible.  Second, philosophy and pedagogy must remain clearly differentiated.  An aim of education is not the same thing as a pedagogy.

The underlying question of education’s fundamental aim has unfortunately been neglected and misunderstood.  Students ask why they have to go to school, and teachers should have a good answer.  This paper considers the work of educational theorists’ in order to identify a consistent philosophy of education.

Egan (1997) considers the modern school as developing contemporaneously with the modern hospital and prison.  Concerning prisons, he asserts that they have in the West “two aims – to punish and to rehabilitate,” and that the incompatibility of these aims leads to difficulties with the system’s implementation (p. 10).  However, it could also be said that the single aim of the modern prison is to reduce crime, in which case there is no conflict in the ends but only in the means.  Alternatively, it could be said that the aim of the prison is simply to house certain individuals away from the general population, for reasons and durations determined by other systems.  It is important to articulate goals at an appropriate level of specificity.

Another modification to Egan’s aims for prisons would be to classify rehabilitation as punishment, in which case there is one goal and it is punishment, or alternatively that punishments contribute to the one goal of rehabilitation.  Either argument, if accepted, collapses Egan’s two aims to just one and eliminates the alleged conflict.  If two goals are truly distinct, it is important to establish this clearly.

After prisons, Egan goes on to construct his dilemma for education, saying that the modern school has not two but three conflicting aims.  The first of these is socialization, “the homogenization of children” and the production of “a skilled workforce of good citizens” (p. 11).  The second aim is labeled Platonic and focuses on “learning those forms of knowledge that would give students a privileged, rational view of reality” (p. 13).  Egan attributes the third aim to Rousseau, with “focus on fulfilling the individual potential of each student” in accordance with “the nature of students’ development, learning, and motivation” (p. 15-16).  After arguing the incompatibility of these three, Egan suggests a fourth of his own, sketching a sort of Vygotskyan recapitulation theory.

There are several problems with Egan’s taxonomy.  Prime among these is a forced mischaracterization of Plato, with whom Egan groups E. D. Hirsch.  Even in Egan’s own words, Plato hoped for students to acquire “the ability to reflect on ideas, to pull them this way and that,” not only to acquire an inert body of knowledge (p. 13).  Plato and Hirsch may be similar in as much as they recommend a curriculum, but they have very different aims.  Hirsch’s ideas clearly fall in Egan’s first category of socialization – his concern is with transmitting a homogenizing “cultural literacy” in order to ensure, for example, a “common reader” (Hirsch, 1987).

Plato’s goals for education are being confused with his pedagogy in Egan’s interpretation.  The aims should be considered independent of the nature or effectiveness of the curriculum.  Regarding his underlying philosophy of education, scholars of The Republic have noted that Plato was intensely aware of the importance of the environment in the development of individuals, and that since the uncultivated environment is often imperfect, it “can be counteracted only by creating a power for good as penetrating, as unconscious, and as universal; and to do this is the true function of a public system of education” (McClintock, 1880/1968, p. 7).  Bosanquet offers that Plato’s school is “the space and atmosphere needed for the human plant to throw out its branches and flowers in their proper shape,” which sounds very much like Egan’s third category (1900, p. 12).

Indeed, Egan’s second, “Platonic” category is a phantom, on the one hand aggrandizing socialization and on the other trivializing the developmental goals shared by Plato and Rousseau alike.  Egan shares these goals as well, and the “new idea” he offers is pedagogical, a different route but not a different destination.  Egan notes that Rousseau saw his work as “a kind of supplement to The Republic,” and upon examination it is clear that there is, beyond pedagogy, no underlying conflict between Plato and the theorists of Rousseau’s tradition (p. 15).

After clarification of Egan’s four aims there are only two.  The first is the expansion of the socialization category to include the transmittance of any inert knowledge, social or otherwise.  The important note here is that transmittance of knowledge as a goal is different from the use of transmittance of knowledge to achieve other goals.  This goal of transmittance is very specific.  It is the goal of Hirsch, the philosophy of Thorndike, the Lockean model of “the child as empty vessel or blank slate” (Lillard, 2005, p. 9).

The second category of aim is that of Rousseau, Plato, and Egan, which has yet to be fully articulated.  The lack of explicit language about this alternative fundamental aim means that in some sense “educators must establish new goals for learning” (Grabinger, p. 667).  This process can begin with a review of the major relevant educational theorists starting, as Egan indicates, with the work of John Dewey.

Dewey, “the man acknowledged to be the pre-eminent educational theorist of the twentieth century,” has conspicuously little to say about fundamental aims in his Experience and Education (1938/1997).  It reflects the characteristic neglect of the issue that there is so little in this work explicitly addressing it.  One might think the reason for this is that it is taken for granted, as a closed question, but it is then a bit surprising when Dewey explicitly claims that “The ideal aim of education is creation of power of self-control” (p. 64).

This unexpected announcement comes in a discussion of Dewey’s notions of social control and the true nature of freedom as freedom of thought, not freedom of action.  He argues that thinking is “a postponement of immediate action” (p. 64).  Much more recent evolutionary theory such as that espoused by Devlin suggests that “off-line thinking,” thought not connected to physical action, is a key factor differentiating humans from other animals (2000).

Other more recent educational theorists have echoed Dewey more or less subtly, as illustrated for example in the title of Bandura’s 1997 Self-Efficacy: The Exercise of Control.  It was also Bandura who said that “self-reflection is the most uniquely human characteristic,” (qtd. in Pajares) aligning with Rousseau’s naturalism in the modern theory of Devlin and others.  This focus on self-control as the product of education, connected as it is to evolutionary psychology, can then be viewed as a result of following Rousseau’s advice to “fix your eye on nature [and] follow the path traced by her” (qtd. in Egan, p. 16).  Piaget said that scientific knowledge is “drawn in large part from common sense,” referring to his fundamental hypothesis of genetic epistemology, “that there is a parallelism between the progress made in the logical and rational organization of knowledge and the corresponding formative psychological processes.” (1968)  This view of knowledge implies that anything one might learn is in some sense a result of the nature of humanity – an interesting formalization of the view that education makes us “more human.”

The two goals then can be contrasted as external versus internal.  The first goal, previously called socialization, is to transmit external knowledge into the student, enforcing social rules, for example, from the outside.  The second goal, of Rousseau, Dewey, and the rest, is to develop the student’s natural internal mental faculties.  One might question the extent to which these goals are incompatible, or, if one were Hirsch, whether the second goal is realistically possible, meaningful, or adequate.

Dr. Maria Montessori experimentally developed a system of education that focuses on the work of students, with minimal transmittance in the usual sense of moving information from teacher or textbook into the student.  She observed that in the proper environment discipline “sprang up spontaneously” and that from a student body of children judged passive or domineering, good or bad, “there remain[ed] only one kind of child” (1967, p. 202).  She called this phenomenon “normalization,” and characterized its discovery as “the most important single result of [her] whole work” (p. 204).  In as much as the goal of Hirsch is a behaviorist goal, concerned with instilling good (socially acceptable, beneficial, hard-working) behavior in students, Montessori provides empirical observations that such behavior can arise from students themselves, if the school setting is appropriately constructed.  Plato’s interpreters would agree that in the “proper space and atmosphere” emerges “this divine humanity which is in the truest sense the self” (Bosanquet, 1900, p. 12; McClintock, 1880/1968, p. 22).

Our faith in human nature can extend beyond behavior of this kind, and in particular to language.  Human children learn to speak and understand their native language without being explicitly taught – an often neglected marvel.  Montessori herself noticed “the many wonders of the language mechanism,” anticipating in some ways the modern linguistics often identified with Noam Chomsky, that “there is something special about the human mind that equips it to acquire language” (Montessori, 1967, p. 116; O’Grady, Archibald, Aronoff, & Rees-Miller, 2005, p. 390).  In this view, language is not learned only from external sources, but is to a degree inborn in all humans as a “Universal Grammar.”  This theory is supported by a number of arguments, as well as the apparent existence of a critical period for language learning (O’Grady, Archibald, Aronoff, & Rees-Miller, 2005).

One’s native language is typically learned before schooling begins, but the nativist idea extends beyond language.  For example, the principal claim of Devlin’s The Math Gene is that “the feature of our brain that enables us to use language is the same feature that makes it possible for us to do mathematics” (p. 70).  Since everyone has the capacity for language, everyone has the capacity for mathematics – a traditional bastion of education.  The reason people appear to vary so much more in mathematical fluency than in linguistic fluency is then a combination of the differences in environments that people are exposed to and the unnatural forms that are imposed by traditional education on mathematics.

At least in these three areas – behavior, language, and mathematics – theory and research support the view that they are natural human abilities and so are definitely developed in accordance with the second or naturalist goal for education.  Generalizing broadly, it can be said that anything humans do, no matter how far removed from the struggle for survival, is by definition a naturally human behavior.  In this way anything one might want to transmit to a student was produced by someone, and could also be produced by the student.  There is no real conflict between “natural” and “school” capabilities; the second goal can achieve the aims of the first as well.

The converse, that inculcation of knowledge should develop the mental faculties of the student, is not so clear, and in fact it seems not to be true in general.  Research indicates that “knowledge acquired in abstract circumstances without direct relevance to the needs of learners is not readily available for application or transfer to novel situations” (Grabinger).  It seems that learning many facts or examples does not always lead to mental generalized concepts or what Beers and others call metacognitive skills.  As Beers asked a teacher who had just spent a class explaining one short story to her class, “what did you show them that they could use with another story?” (Beers, 51)

Standardized tests provide a good illustration of this point.  Although of course such tests vary in quality, test-takers with advanced cognitive skills provided even meager content exposure can generally be expected to test well.  Those with extensive externally collected content knowledge may have positive results on the relevant tests, but will not necessarily develop cognitive skills in the process of internalizing this data.  Their knowledge is not a transferable, applicable, and hence not a useful, thing.  So while well-written standardized tests may provide a rough measure of cognitive ability, they encourage or at least allow the test-specific alternative.

Important mental skills are not served well by this first goal.  As Grabinger reports, there is a difference between intentional and incidental learning (p. 672).  Hirsch’s own 1898 dissertation research showed that a teacher’s instruction in one specific task did not help students perform other related tasks.  Hirsch concluded that the problem was with the human mind rather than with the nature of the instruction or evaluation.  The Deweyan goal for education provides a necessary alternative to this bleak outlook.

Jerome Bruner “take[s] as perhaps the most general objective of education that it cultivate excellence” and explains that this goal means “helping each student achieve his [or her] optimum intellectual development” (1960/1977, p. 9).  What is meant by “excellence” or “optimum intellectual development” is not necessarily well-defined.  Firstly, “optimum” may imply a stopping point.  This language can however be alternatively understood to not contradict the common sentiment that one should always “desire to go on learning,” so that it is this process of learning that proceeds optimally rather than merely reaching an optimal level (Dewey, 1938/1997, p. 48).  Secondly, specifying each student’s optimum begs the question of how much students might vary and how this should be taken into consideration.

Since education cannot very well affect the genetic nature of students, one would hope that the natural endowments of students would not vary too wildly, and this appears to be the case.  The three examples above, most notably the cognitive capability for mathematics, are human universals.  Humans are genetically more similar than different, and the degree to which internal characteristics appear depends largely on the outside environment (Ridley, 2003).  Across many fields, “the preponderance of psychological evidence indicates that experts are made, not born” (Ross, 2006).

Having analyzed the developmental aim of education, some consideration of how it may affect the substance of education is in order.  Neither category of goal specifies a particular classroom, content, pedagogy, and so forth, but the second goal does not even specify that there might be a fixed curriculum at all.  However, neither does it preclude fixed content or systems of instruction – in fact all the major theorists demand structure of some kind in education (eg. Dewey, Montessori).  While it is not specified that schools should include a fixed list of disciplines for study, the second goal certainly allows for them, and provides an overarching principle for organizing and motivating their instruction.  Considering the example of language, we see that children learn to speak and understand through immersion in speech communities in which they are free to participate fully – this can provide a model for what we call “learning communities.”  Schools provide communities that might not be available elsewhere, identified by rigorous, disciplined thought, where teachers should provide examples of the appropriate fluencies.

The educational goal of developing students’ intellectual abilities is well-founded in this analysis.  Students possess natural capabilities that may be brought out by education, and in fact anything that might be taught can be so developed naturally.  It is a more robust goal as compared to the transmittance of knowledge alone, since it can effectively encourage the other while the reverse is questionable.  It is a goal that can benefit all students, and it is the goal advocated, explicitly or not, by nearly all modern philosophers of education.  Arriving at this goal provides a solid foundation from which to build educational programs and evaluate existing systems, but it is clearly a starting and not a finishing point.  Important questions of what intellectual faculties are and how they can be developed remain, only hinted at herein.  As with all things, it is important to identify and understand the questions before one can reasonably hope to find answers.


We are all aware, probably, that the word “school” is derived from a Greek word meaning “leisure.”  This conception of “leisure” is one of the greatest ideas that the Greeks have left us.  It is not that of amusement or holiday-making.  It is opposed both to this and to the pressure of bread-winning industry, and indicates, as it were, the space and atmosphere needed for the human plant to throw out its branches and flowers in their proper shape.  “To have leisure for” any occupation, was to devote yourself to it freely, because your mind demanded it ; to make it, as it were, your hobby. It does not imply useless work, but it implies work done for the love of it.  In the modern world leisure is a hard thing to get ; and yet, wherever a mind is really and truly growing, the spirit of leisure is there.  It is worth thinking of, how far in education the idea of the growth of a mind can be made the central point, so that the things which are considered worth teaching may really have time to sink into and to nourish the whole human being, morally and intellectually alike.

(Bosanquet, 1900, p. 11-12)



Beers, K. (2003). When kids can’t read: What teachers can do. Portsmouth, NH: Heinemann.

Bosanquet, B. (1900). Introduction. In The education of the young in The Republic of Plato (pp. 1-23) [Introduction]. London: Cambridge University Press.

Bruner, J. S. (1977). The process of education. Cambridge, MA: Harvard University Press. (Original work published 1960)

Department of Education Organization Act, 20 U.S.C. § 3402 (1979),‌uscode/‌html/‌uscode20/‌usc_sec_20_00003402—-000-.html.

Devlin, K. (2000). The math gene: How mathematical thinking evolved and why numbers are like gossip. Basic Books.

Dewey, J. (1997). Experience and education (Touchstone ed.). Kappa Delta Pi lecture series. New York: Simon & Schuster. (Original work published 1938)

Egan, K. (1997). Three old ideas and a new one. In The educated mind: How cognitive tools shape our understanding (pp. 9-32). Chicago: The University of Chicago Press.

Frank, S. (1947). Education of women according to Plato. In Plato’s theory of education (pp. 287-308). New York: Harcourt, Brace and Company.

Grabinger, R. S. (n.d.). Rich environments for active learning.

Hirsch, E. D., Jr. (1987). The practical outlook. In Cultural literacy: What every American needs to know (pp. 134-145). Boston: Houghton Mifflin.

Lillard, A. S. (2005). Montessori: The science behind the genius. New York: Oxford University Press.

McClintock, R. L. (1968). The theory of education in the Republic of Plato. New York: Teachers College Press. (Original work published 1880)

Montessori, M. (1967). The absorbent mind (C. A. Claremont, Trans.). New York: Holt, Rinehart and Winston.

O’Grady, W., Archibald, J., Aronoff, M., & Rees-Miller, J. (2005). Contemporary linguistics: An introduction (5th ed.). Boston: Bedford/‌St. Martin’s.

Pajares, F. (n.d.). Self-efficacy beliefs in academic contexts: An outline.

Piaget, J. (1968). Genetic Epistemology (E. Duckworth, Trans.). New York: Columbia University Press.

Ridley, M. (2003). The agile gene: How nature turns on nurture. New York: HarperCollins.

Ross, P. E. (2006, August). The expert mind. Scientific American. Retrieved August 20, 2006, from‌article.cfm?articleID=00010347-101C-14C1-8F9E83414B7F4945&ref=sciam&chanID=sa006

Whitehead, A. N. (1978). Part II, Chapter I, Section I. In D. R. Griffin & D. W. Sherburne (Eds.), Process and reality: An essay in cosmology (Corrected ed., pp. 39-42). New York: The Free Press. (Original work published 1929)

Some theory and practice for data cleaning

Data cleaning can refer to many things. I’ll mention some important data structure ideas elucidated by Wickham, and then spend more time on the topic of assumptions and data coherence, and how data cleaning happens in practice. By the end, we may achieve some hope of getting reasonably meaningful results from our data.

Hadley Wickham has made available very good explanations (paper, slides, presentation) of what he calls “tidy data”. Tidy data has variables stored in columns, observations in rows, and a single type of experimental unit per dataset. This framework is both a good idea in its own right, and it also let’s you easily use a lot of very good data manipulation and graphics tooling in R – much of which is also due to Wickham.

The idea of tidy data provides guidance in how to begin restructuring data for analysis. What, then, are the data problems that are not fundamentally due to data structure?

Harry Frankfurt has a charming essay, published as a small book, called On Bullshit.

cover of On Bullshit

The book is quite funny, and it also provides a useful definition of bullshit: communicating without concern for the truth, or with “indifference to how things really are” (p. 34). It claims, and may be correct, that “bullshit is a greater enemy of the truth than lies are” (p. 61).

A lot of data has this bullshit lack of regard for facts, and it is this sort of data problem that I am principally concerned with. This class of problem can arise all throughout the data quality framework of McCallum, which calls for data to be complete, coherent, correct, and accountable, though typically coherence admits most easily of self-contained testing.

The vast majority of data problems are not intentionally introduced. They appear by accident or through miscommunication, and because they were not detected and corrected. It isn’t easy to completely assure the quality of a data set, and in any event there frequently aren’t resources allocated to thorough checking. Data can seem quite authoritative when viewed from a distance – why check it?

It would be nice to have at least two individuals (or teams) develop the same data product in parallel, collaborating on a shared set of business rules, so that they can check their final results against one another, and have a further QA step before release. But the model of a lone individual hacking out a spreadsheet without so much as a second look before release is unfortunately common. When it comes to open data initiatives, for example, it sometimes seems that data is scooped from a trough and thrown into the world.

Data dirtiness could be addressed at creation, and we should keep this in mind when creating data ourselves. Keep it clean! But as it is often not, it is incumbent on the analyst to be diligent in checking and addressing issues with the data under analysis. Be aware of the process you are a part of.

phenomenon of interest - data creation - data - analysis - system of beliefs

All of analysis is coming up with beliefs. These beliefs are based on implicit models. Probably the most common cause of initial results from data is error. This is especially true for interesting results, but it applies everywhere. For example:

$ wc -l data.csv 
       5 data.csv

A common reading of this would be that the file data.csv contains five lines, or rows. However, this is not necessarily the case:

> read.table("data.csv")
1 I think...\nthat...\nit's good!
2                           dunno
3                       It's bad!

There are three rows here. What happened? There was a problem with our assumptions, or with the implicit model we used. The program wc does not tell us how many lines there are in a file. It tells us how many newline characters (“\n”) there are – how many times there is a byte that goes 00001010. The implicit assumption was that each line of data had exactly one newline character. It was a linear model: “number of lines in the file = 1 * number of newline characters in the file + 0”. Notice that the model is fine, and the result is also fine – if the assumption holds. But often, we aren’t even aware of the assumption, we aren’t aware of the implicit model, and we don’t check what we’re relying on.

The example with wc is about being aware of levels of abstraction and how they interact with your tools so as to not draw incorrect conclusions. The sorts of assumptions more closely related to what we usually think of as clean data issues are things like:

  • this is everything we should include
  • there’s exactly one record per entity
  • the line items add up to the subtotals
  • these are all in the same units
  • these dates are all in order

A good place to start with a new data set is by checking for problems that could not possibly occur. Very often, you will find that they occur anyway. A common example is checking unique IDs. They are frequently not.

When it comes to checking things, I recommend never checking anything once. When you become aware of something that should be true about your data, write a check that embeds itself forever in your code. Very occasionally there are performance concerns, but almost always correctness is more important. The “theory” here is that the strength of the computer – automating things – should be leveraged to reduce human cognitive load while increasing confidence in results. Include explicit assumption tests in analysis code.

Checks can also be included for things that need not be true – for intermediate results, totals, and so on. Analysis code may have many steps, and it may not be obvious how a change at an early step affects things downstream. Including something as simple as, for example, stopifnot(sum(data$total)==9713) at the bottom of a script will alert you if you introduce something that changes this – especially useful to know when you think you’re making changes that don’t.

Another way to make this point about coded checks is that comments in code should be avoided for describing data.

# There are 49 missing values.     # NO
stopifnot(sum( # YES

The comment is instantly out of date if something changes, and likely forgotten. The code will automatically let you know if something changes. This is vastly superior.

Quite a lot of checks can be written very simply – for example, the often-neglected checks around merges (joins). Many more complex checks are included in the assertive package available for R, which has things like assert_all_are_credit_card_numbers and assert_all_are_isbn_codes, for example.

Many checks will be domain-specific, but it’s common to have univariate distributions to investigate. The tails (outliers) are often the result of some data problem, and we should be grateful when this is the case, because it’s much easier to notice than if the spurious data is throughout the range of correct values. Of course the tails are also important places to take care because they will have dramatic effects on many types of analysis.

This is a place where data visualization is an essential tool for data cleaning. Histograms can show some information about a distribution. I often prefer what Cleveland calls quantile plots (different from QQ plots) which can be easily made in R (with index rather than quantile labels) as plot(sort( These plots do not depend on bin size – every data point is shown. Much more of the fine structure of a distribution can be quickly and easily taken in. This image shows a contrived data set two ways, illustrating the occasional advantage of quantile plots.

histogram vs quantile

Many data checks might be called “reality checks”. It is worth being aware of the danger of crossing over into “unreality checks” when our biases overrule the data. There is often an essentially Bayesian flavor to checking data. We suspect that a janitor’s salary of $100,000,000 is not correct because of our prior beliefs about the world. Probably we are correct to say that this data is bad. But if we are over-zealous in eliminating or adjusting data that seems wrong, we may end up eliminating or adjusting the most interesting correct data as well. Care must be taken to make choices that maximize the information content of our data.

When possible, fix data. For example, adult heights entered sometimes in feet and sometimes in centimeters can be easily adjusted to either by machine because their ranges do not overlap. This is probably better than dropping all of either type, but choices like this may not always be easy.

It’s not uncommon to hear it said that 80% of a data project is data wrangling work. But this doesn’t mean that you can do data cleaning Monday through Thursday and analysis on Friday. It is a good idea to think about data cleaning up front, but there are often practical concerns about the scope of the data cleaning work.

clean all the data!

> ncol(
[1] 1274

clean all the data?

In addition, it’s often the case that you discover data problems in the course of analysis. (Sometimes, this is the only way that data problems are discovered.) You likely can’t predict at the outset exactly which fields you’ll need or how problems will arise. A simple example is that two columns may each look fine, but if analysis calls for their ratio, absurd data problems can become evident.

Another concern is that data issues are often stubbornly non-general. We would like to aim to generalize, but may find that ad hoc solutions are sometimes unavoidable. We are left aiming to generalize, preparing to specialize.

Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles. As usual:

The difference between theory and practice is larger in practice than it is in theory.

What’s the difference between Bayesian and non-Bayesian statistics?

A coin is flipped and comes up heads five times in a row. Is it a fair coin?

Whether you trust a coin to come up heads 50% of the time depends a good deal on who’s flipping the coin. If you’re flipping your own quarter at home, five heads in a row will almost certainly not lead you to suspect wrongdoing. At a magic show or gambling with a shady character on a street corner, you might quickly doubt the balance of the coin or the flipping mechanism.

What is often meant by non-Bayesian “classical statistics” or “frequentist statistics” is “hypothesis testing”: you state a belief about the world, determine how likely you are to see what you saw if that belief is true, and if what you saw was a very rare thing to see then you say that you don’t believe the original belief. That original belief about the world is often called the “null hypothesis”.

Our null hypothesis for the coin is that it is fair – heads and tails both come up 50% of the time. If that’s true, you get five heads in a row 1 in 32 times. That’s 3.125% of the time, or just 0.03125, and this sort of probability is sometimes called a “p-value”. If the value is very small, the data you observed was not a likely thing to see, and you’ll “reject the null hypothesis”. The cutoff for smallness is often 0.05. So the frequentist statistician says that it’s very unlikely to see five heads in a row if the coin is fair, so we don’t believe it’s a fair coin – whether we’re flipping nickels at the national reserve or betting a stranger at the bar.

Say a trustworthy friend chooses randomly from a bag containing one normal coin and two double-headed coins, and then proceeds to flip the chosen coin five times and tell you the results. When would you be confident that you know which coin your friend chose? If a tails is flipped, then you know for sure it isn’t a coin with two heads, of course. But what if it comes up heads several times in a row? When would you say that you’re confident it’s a coin with two heads?

If you stick to hypothesis testing, this is the same question and and the answer is the same: reject the null hypothesis after five heads.

Notice that when you’re flipping a coin you think is probably fair, five flips seems too soon to question the coin. But when you know already that it’s twice as likely that you’re flipping a coin that comes up heads every time, five flips seems like a long time to wait before making a judgement. The non-Bayesian approach somehow ignores what we know about the situation and just gives you a yes or no answer about trusting the null hypothesis, based on a fairly arbitrary cutoff.

The Bayesian approach to such a question starts from what we think we know about the situation. This is called a “prior” or “prior distribution”. In the case of the coins, we understand that there’s a \frac{1}{3} chance we have a normal coin, and a \frac{2}{3} chance it’s a two-headed coin.

The Bayesian next takes into account the data observed and updates the prior beliefs to form a “posterior” distribution that reports probabilities in light of the data. The updating is done via Bayes’ rule, hence the name. In Gelman’s notation, this is:

\displaystyle p(\theta|y) = \frac{p(\theta)p(y|\theta )}{p(y)}

For our example, this is: “the probability that the coin is fair, given we’ve seen some heads, is what we thought the probability of the coin being fair was (the prior) times the probability of seeing those heads if the coin actually is fair, divided by the probability of seeing the heads at all (whether the coin is fair or not)”. This is true.

So say our friend has announced just one flip, which came up heads. Back with the “classical” technique, the probability of that happening if the coin is fair is 50%, so we have no idea if this coin is the fair coin or not. With Bayes’ rule, we get the probability that the coin is fair is \frac{\frac{1}{3} \cdot \frac{1}{2}}{\frac{5}{6}}. (Conveniently, that p(y) in the denominator there, which is often difficult to calculate or otherwise know, can often be ignored since any probability that we calculate this way will have that same denominator.) In our case here, the answer reduces to just \frac{1}{5} or 20%. There’s an 80% chance after seeing just one heads that the coin is a two-headed coin. After four heads in a row, there’s 3% chance that we’re dealing with the normal coin. Notice that even with just four flips we already have better numbers than with the alternative approach and five heads in a row. And the Bayesian approach is much more sensible in its interpretation: it gives us a probability that the coin is the fair coin. With the earlier approach, the probability we got was a probability of seeing such results if the coin is a fair coin – quite different and harder to reason about.

It’s tempting at this point to say that non-Bayesian statistics is statistics that doesn’t understand the Monty Hall problem. But of course this example is contrived, and in general hypothesis testing generally does make it possible to compute a result quickly, with some mathematical sophistication producing elegant structures that can simplify problems – and one is generally only concerned with the null hypothesis anyway, so there’s in some sense only one thing to check. The Bayesian formulation is more concerned with all possible permutations of things, and it can be more difficult to calculate results, as I understand it – especially difficult to come up with closed forms for things. There again, the generality of Bayes does make it easier to extend it to arbitrary problems without introducing a lot of new theory.

The example with the coins is discrete and simple enough that we can actually just list every possibility. In general this is not possible, of course, but here it could be helpful to see and understand that the results we get from Bayes’ rule are correct, verified diagrammatically:

diagram of coin scenarios

Here tails are in grey, heads are in black, and paths of all heads are in bold. You can see, for example, that of the five ways to get heads on the first flip, four of them are with double-heads coins.

I’m thinking about Bayesian statistics as I’m reading the newly released third edition of Gelman et al.’s Bayesian Data Analysis, which is perhaps the most beautiful and brilliant book I’ve seen in quite some time. The example here is logically similar to the first example in section 1.4, but that one becomes a real-world application in a way that is interesting and adds detail that could distract from what’s going on – I’m sure it complements nicely the traditional abstract coin-flipping probability example here. I’ll also note that I may have over-simplified the hypothesis testing side of things, especially since the coin-flipping example has no clear idea of what is more extreme (all tails is as unlikely as all heads, etc.), there was no experiment design or reasoning about that side of things, and so on. I think the characterization is largely correct in outline, and I welcome all comments!

Rambling toward a definition of Big Data

I was at Big Data Camp today and it made me think again about what Big Data is. Often this is phrased as “How big is big data?” and there are many answers. Sometimes the size issue is sidestepped to frame Big Data as related to structured vs. unstructured data rather than or in addition to raw size. There has been not quite a proliferation of contradictory definitions but at least a failure of one accepted definition to dominate.

I offer that Big Data is Big in the literal sense, but as is so often the case it’s a question of “big compared to what?”. There are appropriate points of comparison, but they are not the same for every problem domain, which leads to apparently differing answers as to the size of Big Data. The appropriate point of comparison is the size of data that some existing tool can handle.

Computers have been doing things like word counts for a long time, so if you’re doing counting words you might not think your data is big until it no longer fits on a single machine. Putting data on multiple machines is a popular class of big data approach. Some people will argue with you about how big a single machine can be, of course, and this approach of getting a really big machine and/or getting new tools that make better use of cores or disks should also be included as Big Data approaches.

Unstructured data can seem big incredibly quickly. This is not a failure of the offered definition here – this is because the existing tool for dealing with unstructured data is often “have a human look at it”. So if you have enough unstructured data that it’s not feasible to have a human read it all (or whatever) I would say that you could fairly call it Big Data even if it’s only a few megabytes. It’s differently big than word counts, certainly.

A common misconception is that doing word counts is a good replacement tool for having humans read essays, for example. If you think this then you begin to conflate to two foregoing examples. Tools for analyzing natural language are so primitive that this type of mistaken thinking is understandable, but it is not correct.

Upon reflection as I write this, I think that the difference in the sense of bigness is significant enough that unstructured data shouldn’t be considered big in the same way that terabytes are. I think it’s probably more appropriate to think of it as two separate axes, one of raw size and one of analysis complexity/difficulty. There’s probably a good two-by-two matrix figure in this.

Thank you, Big Data Camp NYC and Strata, for stimulating this line of thought.

Data Science is Learning from Data

There are a lot of unhelpful definitions of data science. To be a useful term, it needs a sensible meaning. This is what I mean by data science:

Data science is learning from data.

The check for whether you’ve just done data science is a two-part test:

  • Did you use data?
  • Did you learn something?

In some ways then, data science is generalized science – science without a specific field. On the other hand, sometimes a data scientist may not develop the “experiments” that generate the data, and in this case data science corresponds more closely to a subset or specialization of scientific skills.

An irate student of mine once argued that if data science is at all sensibly named, there should be hypotheses being tested. This often doesn’t seem to be the case. I agree that the scientific method is a beautiful thing, but I also think that a lot of good science has been and continues to be observational. The reason for an expedition to Galapagos was never to test a hypothesis on the existence of the Blue-footed Booby, for example. Often there’s much to be learned just in describing data.

If your machine is learning but you’re not, you aren’t doing data science. You may certainly be doing very good data engineering and solving real problems. Algorithms and their development is the field of computer science. Their application happens in the fields of software engineering and data engineering. Deep learning and so on work very nicely and solve real problems, but if there isn’t a finding that humans can understand, it isn’t data science. Data science can certainly use statistics and machine learning, but black box techniques are not generally helpful for human understanding.

It certainly isn’t about the size of the data. There are techniques you need in order to work with big data, but these are just techniques. The scientific method is not obsolete. It is true that more is different, but the way that it’s different is that it’s more. If there’s any change to science, it’s that there’s a backlog of analysis due to the large amount of data. But there again, the data that’s backing up is usually not the kind that comes from proper experiments. We’ll still need experiments.

This definition of a “data scientist” is not far at all from “data analyst”. It may be that the reasons to use the name “data scientist” instead range only from “sexier buzzword” to “distance field from know-nothing low-level and business analysts”.

Business abuses the term data science in two main ways. The first is understandable, since data science requires the use of some computer science and engineering techniques. But data science is not primarily about engineering (i.e., building) products. Data science could be involved, but most of that is engineering work. A more useless view from business folks is well explained by IBM: “[w]hat sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings”. Data scientist should not just be a higher-ranking business title than data analyst, and everyone should be able to communicate – unless what you’re really looking for is something like a data journalist, somehow parallel to the way science journalists communicate about science.

Let us go forth and learn about the world.


Infographics are dead – Long live information graphics!

The word “infographic” has come to connote bad design in combination with comically low or even negative data density.*

A recent blog post contrasts “infographics” with “data visualizations”.** A recent xkcd expresses a similar critical sentiment, directed particularly at “tall infographics” which require scrolling through their colorful information deserts. The WTF Visualizations tumblr collects examples of bad information design, most from what would be called infographics.

The “information graphic” of Bret Victor’s 2006 Magic Ink paper is the elegant product of Tufte’s “information design” and the best way to conceptualize effective “information software”. An information graphic is made “to display a complex set of data in a way that [the viewer] can understand it and reason about it.” “Show the data.” “It is for learning.”

While the pejorative sense of “infographic” is dominant now, let’s remain committed to the ideal of good communication through information graphics.



* Negative data density is achieved when incorrect or confusing communication leads to a net loss in human understanding.

** There are two good points in the qunb post about visual perception and increasing data-ink ratio. This is standard Tufte/Few fare. Most of the post seems to be about defining infographics as made by particular tools, like Photoshop, and specifically as static images. Upon trying out their demo product, it’s clear why. If you don’t attend closely to their complete definition of an infographic, you’ll think that the qunb “data stories” product is… an animated slide-show of infographics.