Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

Data done wrong: The only-most-recent data model

It’s not very uncommon to encounter a database that only stores the most recent state of things. For example, say the database has one row per Danaus plexippus individual. The database could have a column called stage which would tell you if an individual is currently a caterpillar or a butterfly, for instance.

This kind of design might seem fine for some application, but you have no way of seeing what happened in the past. When did that individual become a butterfly? (Conflate, for the moment, the time of the change in the real world and the time the change is made in the database – and say that the change is instantaneous.) Disturbingly often, you find after running a timeless database for some time that you actually do need to know about how the database changed over time – but you haven’t got that information.

There are at least two approaches to this problem. One is to store transactional data. In the plexippus example this could mean storing one row per life event per individual, with a date-time of database execution. The current known state of each individual can still be extracted (or maintained as a separate table). Another approach is to use a database that tracks all changes; the idea is something like version control for databases, and one implementation with a philosophy like this is datomic.

With a record of transactional data or a database that stores all transactions, you can query back in time: what was the state of the database at such-and-such time in the past? This is much better than our original setup. We don’t forget what happened in the past, and we can reproduce our work later even if the data is added to or changed. Of course this requires that the historical records not be themselves modified – the transaction logs must be immutable.

This is where simple transactional designs on traditional databases fail. If someone erroneously enters on April 4th that an individual became a butterfly on April 3rd, when really the transformation occurred on April 2nd, and this mistake is only realized on April 5th, there has to be a way of adding another transaction to indicate the update – not altering the record entered on April 4th. This can quickly become confusing – it can be a little mind-bending to think about data about dates which changes over time. The update problem is a real headache. I would like to find a good solution to this.

Monarch_In_May

NYC Test Data

A series of posts analyzing publicly available New York City Math and English Language Arts (ELA) standardized test results. There are a lot of graphs. Code is on github.

  1. Putting the data together and looking at it
  2. Checking out the number of students tested in Math and ELA
  3. Checking out the number of students tested in Math and ELA again
  4. The total number of students and tests
  5. The total number of students and tests by grade
  6. Considering District 75 schools
  7. The total number of tests by grade viewed by cohort
  8. Number of students tested at the school grade subject level
  9. Normalizing the distributions of average scores
  10. Schools fight the Law of Large Numbers
  11. Changes in average scores for school grades and cohorts
  12. Changes in scores by year – where is the Common Core shake-up?

 

Some theory and practice for data cleaning

Data cleaning can refer to many things. I’ll mention some important data structure ideas elucidated by Wickham, and then spend more time on the topic of assumptions and data coherence, and how data cleaning happens in practice. By the end, we may achieve some hope of getting reasonably meaningful results from our data.

Hadley Wickham has made available very good explanations (paper, slides, presentation) of what he calls “tidy data”. Tidy data has variables stored in columns, observations in rows, and a single type of experimental unit per dataset. This framework is both a good idea in its own right, and it also let’s you easily use a lot of very good data manipulation and graphics tooling in R – much of which is also due to Wickham.

The idea of tidy data provides guidance in how to begin restructuring data for analysis. What, then, are the data problems that are not fundamentally due to data structure?

Harry Frankfurt has a charming essay, published as a small book, called On Bullshit.

cover of On Bullshit

The book is quite funny, and it also provides a useful definition of bullshit: communicating without concern for the truth, or with “indifference to how things really are” (p. 34). It claims, and may be correct, that “bullshit is a greater enemy of the truth than lies are” (p. 61).

A lot of data has this bullshit lack of regard for facts, and it is this sort of data problem that I am principally concerned with. This class of problem can arise all throughout the data quality framework of McCallum, which calls for data to be complete, coherent, correct, and accountable, though typically coherence admits most easily of self-contained testing.

The vast majority of data problems are not intentionally introduced. They appear by accident or through miscommunication, and because they were not detected and corrected. It isn’t easy to completely assure the quality of a data set, and in any event there frequently aren’t resources allocated to thorough checking. Data can seem quite authoritative when viewed from a distance – why check it?

It would be nice to have at least two individuals (or teams) develop the same data product in parallel, collaborating on a shared set of business rules, so that they can check their final results against one another, and have a further QA step before release. But the model of a lone individual hacking out a spreadsheet without so much as a second look before release is unfortunately common. When it comes to open data initiatives, for example, it sometimes seems that data is scooped from a trough and thrown into the world.

Data dirtiness could be addressed at creation, and we should keep this in mind when creating data ourselves. Keep it clean! But as it is often not, it is incumbent on the analyst to be diligent in checking and addressing issues with the data under analysis. Be aware of the process you are a part of.

phenomenon of interest - data creation - data - analysis - system of beliefs

All of analysis is coming up with beliefs. These beliefs are based on implicit models. Probably the most common cause of initial results from data is error. This is especially true for interesting results, but it applies everywhere. For example:

$ wc -l data.csv 
       5 data.csv

A common reading of this would be that the file data.csv contains five lines, or rows. However, this is not necessarily the case:

> read.table("data.csv")
                               V1
1 I think...\nthat...\nit's good!
2                           dunno
3                       It's bad!

There are three rows here. What happened? There was a problem with our assumptions, or with the implicit model we used. The program wc does not tell us how many lines there are in a file. It tells us how many newline characters (“\n”) there are – how many times there is a byte that goes 00001010. The implicit assumption was that each line of data had exactly one newline character. It was a linear model: “number of lines in the file = 1 * number of newline characters in the file + 0”. Notice that the model is fine, and the result is also fine – if the assumption holds. But often, we aren’t even aware of the assumption, we aren’t aware of the implicit model, and we don’t check what we’re relying on.

The example with wc is about being aware of levels of abstraction and how they interact with your tools so as to not draw incorrect conclusions. The sorts of assumptions more closely related to what we usually think of as clean data issues are things like:

  • this is everything we should include
  • there’s exactly one record per entity
  • the line items add up to the subtotals
  • these are all in the same units
  • these dates are all in order

A good place to start with a new data set is by checking for problems that could not possibly occur. Very often, you will find that they occur anyway. A common example is checking unique IDs. They are frequently not.

When it comes to checking things, I recommend never checking anything once. When you become aware of something that should be true about your data, write a check that embeds itself forever in your code. Very occasionally there are performance concerns, but almost always correctness is more important. The “theory” here is that the strength of the computer – automating things – should be leveraged to reduce human cognitive load while increasing confidence in results. Include explicit assumption tests in analysis code.

Checks can also be included for things that need not be true – for intermediate results, totals, and so on. Analysis code may have many steps, and it may not be obvious how a change at an early step affects things downstream. Including something as simple as, for example, stopifnot(sum(data$total)==9713) at the bottom of a script will alert you if you introduce something that changes this – especially useful to know when you think you’re making changes that don’t.

Another way to make this point about coded checks is that comments in code should be avoided for describing data.

# There are 49 missing values.     # NO
stopifnot(sum(is.na(my.data))==49) # YES

The comment is instantly out of date if something changes, and likely forgotten. The code will automatically let you know if something changes. This is vastly superior.

Quite a lot of checks can be written very simply – for example, the often-neglected checks around merges (joins). Many more complex checks are included in the assertive package available for R, which has things like assert_all_are_credit_card_numbers and assert_all_are_isbn_codes, for example.

Many checks will be domain-specific, but it’s common to have univariate distributions to investigate. The tails (outliers) are often the result of some data problem, and we should be grateful when this is the case, because it’s much easier to notice than if the spurious data is throughout the range of correct values. Of course the tails are also important places to take care because they will have dramatic effects on many types of analysis.

This is a place where data visualization is an essential tool for data cleaning. Histograms can show some information about a distribution. I often prefer what Cleveland calls quantile plots (different from QQ plots) which can be easily made in R (with index rather than quantile labels) as plot(sort(my.data)). These plots do not depend on bin size – every data point is shown. Much more of the fine structure of a distribution can be quickly and easily taken in. This image shows a contrived data set two ways, illustrating the occasional advantage of quantile plots.

histogram vs quantile

Many data checks might be called “reality checks”. It is worth being aware of the danger of crossing over into “unreality checks” when our biases overrule the data. There is often an essentially Bayesian flavor to checking data. We suspect that a janitor’s salary of $100,000,000 is not correct because of our prior beliefs about the world. Probably we are correct to say that this data is bad. But if we are over-zealous in eliminating or adjusting data that seems wrong, we may end up eliminating or adjusting the most interesting correct data as well. Care must be taken to make choices that maximize the information content of our data.

When possible, fix data. For example, adult heights entered sometimes in feet and sometimes in centimeters can be easily adjusted to either by machine because their ranges do not overlap. This is probably better than dropping all of either type, but choices like this may not always be easy.

It’s not uncommon to hear it said that 80% of a data project is data wrangling work. But this doesn’t mean that you can do data cleaning Monday through Thursday and analysis on Friday. It is a good idea to think about data cleaning up front, but there are often practical concerns about the scope of the data cleaning work.

clean all the data!

> ncol(my.data)
[1] 1274

clean all the data?

In addition, it’s often the case that you discover data problems in the course of analysis. (Sometimes, this is the only way that data problems are discovered.) You likely can’t predict at the outset exactly which fields you’ll need or how problems will arise. A simple example is that two columns may each look fine, but if analysis calls for their ratio, absurd data problems can become evident.

Another concern is that data issues are often stubbornly non-general. We would like to aim to generalize, but may find that ad hoc solutions are sometimes unavoidable. We are left aiming to generalize, preparing to specialize.

Probably data cleaning will remain art and science, entangled with analysis, and resistant to fully generalizable principles. As usual:

The difference between theory and practice is larger in practice than it is in theory.

Bad Data Handbook: Quite Good

Preparing for an upcoming talk, I thought I’d read this book:

bad data handbook cover

(You can find this book online for free – is that legit?)

I like this book quite a lot. It’s a collection of chapters by different authors, and reads something like a series of excellent blog posts. With the exception of chapter 18, it’s quite good. It covers a lot of the issues that arise in practice when gathering and starting to work with data. The explanation of text encoding in chapter 4 could be the best I’ve seen, and chapter 14 (“myths of cloud computing”) is something I wish a lot of people who present themselves as “cloud experts” would read and understand. Philipp K. Janert, author of Data Analysis with Open Source Tools, contributes a very nice chapter as well.

The book closes with a “framework” for data quality, with these “four Cs”:

  • Complete
  • Coherent
  • Correct
  • aCcountable

It’s not bad, this book. I’d recommend it to anyone who needs to work with data in the real world. I think there’s room for even more theory and practice of data cleaning; I’d like to see an even better book yet!

Excel Two Thousand and Wat

There’s no perfect tool. Things get done well when you know your tools inside and out. Because I enjoy complaining about Excel, and because I enjoy Gary Bernhardt’s “wat” talk about typically unexpected behaviors in JavaScript and Ruby, I offer here “Excel Two Thousand and Wat”, combining some personal pet peeves and a few other Excel oddities. (I’m using Excel 2011 for Mac; your mileage may vary.)

Let’s talk about Excel.

Say we have a nice UTF-8 CSV file:

$ wc test.csv
       8       8      56 test.csv
$ cat test.csv
id,value
001,1
010,2
01a,3
100,4
101,하하
110,5
111,6
$ file test.csv
test.csv: UTF-8 Unicode text

It has eight lines: a header and seven rows of data. It has two columns: an ID column and a value column. The IDs are all three characters long. The values have some unicode: Korean characters for “ha ha”. So of course Excel will have no problem opening this.

excel window

wat

Excel has kindly munched the leading zeros from our ID column. Why would we want the data that was in the file we opened, after all? And of course, it is totally appropriate for modern software to not understand UTF-8 encoded text.

Well, we can fix this! We can type the Korean into Excel by hand.

excel window

And we can fix the chomped IDs by typing them in with a leading apostrophe. Obviously.

excel window

Excel is so helpful, second-guessing the data type of every cell. This is a great feature! For example, Excel will render “3+5” as “3+5” but show “=3+5” as “8” – it uses that equal sign to know if it’s supposed to evaluate an equation. Great! So something like “++++3”, which is obviously not an equation, will be fine!

excel window

wat

Excel took the entry of “++++3” and then: a) turned it into an equation, and b) ate one of the pluses. If you re-evaluate the cell (with Ctrl-u, Ctrl-Enter, for instance) Excel will keep eating pluses.

But enough confusion! Let’s talk about Excel.

Since we’ve fixed the errors introduced by Excel, let’s go ahead and save as a new CSV file, tested.csv. Nothing will have changed, of course, with the worksheet that we still have open.

excel window

Oh good – the Korean we typed in still appears in the sheet, but in the formula bar now it’s two empty boxes.

wat

Things will look fine in the file though, certainly.

$ wc tested.csv 
       0       8      51 tested.csv
$ cat tested.csv 
111,6$ e

Now our CSV has zero lines, and confuses the heck out of bash. Good. What if we re-open the file in Excel?

excel window

The leading zeros we took pains to put back in are cleaned out again, and the Korean has been helpfully simplified to two underscores.

But enough about failures to get the basics right. Let’s talk about Excel.

Excel famously does a great job of handling missing values. If you type in “TRUE” it’s the boolean truth value, and “FALSE” is false. We can check with an “if” statement in an equation. What should an empty cell be?

excel window

An empty cell is false, naturally. We can check it with “and” in an equation as well. “True and false” is false, as we all know.

excel window

Ah – “True and blank” is true, so… blank is also true. Got it?

wat

That’s enough! (Of course there are also my gripes with how Excel’s PERCENTRANK works.) Have fun, and use your tools well!

Use counts, not percentages

Consider this data:

total    part   percent
  765      47        6%

Clearly, there is some redundancy. Both part and percent express the same thing.

With infinite precision, you could use either part or percent at your pleasure. However, in the common case where the counts (total and part) are integers and the percentage(s) are not, computers will store the integers generally much more nicely and compactly than nasty decimal things (floats or string representations).

Percentages also commonly get rounded off, in which case information is lost. In the above example, 6% of 765 could be anything from 43 to 49, and possibly even more depending on what precision is used for the calculation.

The moral of the story is that for data, you should always use counts, not percentages.