Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

The Information: a History, a Theory, a Flood

This is a really good book.

James Gleick is excellent. The history is beautifully researched and explained; there is so much content, and it is all fitted together very nicely.

The core topic is information theory, with the formalism of entropy, but perhaps it’s better summarized as the story of human awakening to the idea of what information is and what it means to communicate. It is a new kind of awareness. Maybe the universe is nothing but information! I’m reminded of the time I met Frederick Kantor.

I’m not sure if The Information pointed me to it, but I’ll also mention Information Theory, Inference, and Learning Algorithms by David J.C. MacKay. This book can be read in PDF for free. I haven’t gone all through it, but it seems to be a good more advanced reference.

The Information: Highly recommended for all!

Dataclysm: There’s another book

Dataclysm is a nicely made book. In the Coda (p. 239) we learn something of why:

Designing the charts and tables in this book, I relied on the work of the statistician and artist Edward R. Tufte. More than relied on, I tried to copy it.

The book is not unpleasant to read, and it goes quickly. It may be successful as a popularization. I rather wish it had more new interesting results. Perhaps the author agrees with me; often the cheerleading for the potential of data reads like disappointment with the actuality of the results so far.

The author’s voice was occasionally quite insufferable. He describes himself “photobombing before photobombing was a thing” in a picture with Donald Trump and Mikhail Gorbachev, for example. This anecdote is around an eighth of the text in the second chapter; perhaps more. The chapter is about the value of being polarizing, so if he alienated me there it may count as a success.

In conclusion: the OkTrends blog is fun; there’s also a book version now.

Plan Space from Outer Nine

education, data, and the internet

Monthly Archives: September 2014

Scraping GNU Mailman Pipermail Email List Archives

The Information: a History, a Theory, a Flood

Dataclysm: There’s another book