Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

5 thoughts on “Scraping GNU Mailman Pipermail Email List Archives

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s