Peeps Be Askin’ Me

Peeps be all the time askin’ me, what are the excellent tech/data things to do in DC? Where are the cool people to be found? What’s good?

Well look:

For data talks and socializing, get with the meetups listed on Data Community DC’s “Speaking Events” page.

For hanging out and hacking on projects that help the world be more awesome, you know you have to be down with Code for DC.

And of course there’s Hack and Tell, but you already know.

Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

The Information: a History, a Theory, a Flood

This is a really good book.

the_information

James Gleick is excellent. The history is beautifully researched and explained; there is so much content, and it is all fitted together very nicely.

The core topic is information theory, with the formalism of entropy, but perhaps it’s better summarized as the story of human awakening to the idea of what information is and what it means to communicate. It is a new kind of awareness. Maybe the universe is nothing but information! I’m reminded of the time I met Frederick Kantor.

I’m not sure if The Information pointed me to it, but I’ll also mention Information Theory, Inference, and Learning Algorithms by David J.C. MacKay. This book can be read in PDF for free. I haven’t gone all through it, but it seems to be a good more advanced reference.

The Information: Highly recommended for all!

Dataclysm: There’s another book

dataclysm

Dataclysm is a nicely made book. In the Coda (p. 239) we learn something of why:

Designing the charts and tables in this book, I relied on the work of the statistician and artist Edward R. Tufte. More than relied on, I tried to copy it.

The book is not unpleasant to read, and it goes quickly. It may be successful as a popularization. I rather wish it had more new interesting results. Perhaps the author agrees with me; often the cheerleading for the potential of data reads like disappointment with the actuality of the results so far.

The author’s voice was occasionally quite insufferable. He describes himself “photobombing before photobombing was a thing” in a picture with Donald Trump and Mikhail Gorbachev, for example. This anecdote is around an eighth of the text in the second chapter; perhaps more. The chapter is about the value of being polarizing, so if he alienated me there it may count as a success.

In conclusion: the OkTrends blog is fun; there’s also a book version now.

How To Sort By Average Rating

Evan Miller’s well-known How Not To Sort By Average Rating points out problems with ranking by “wrong solution #1” (by differences, upvotes minus downvotes) and “wrong solution #2” (by average ratings, upvotes divided by total votes). Miller’s “correct solution” is to use the lower bound of a Wilson score confidence interval for a Bernoulli parameter. I think it would probably be better to use Laplace smoothing, because:

  • Laplace smoothing is much easier
  • Laplace smoothing is not always negatively biased

This is the Wilson scoring formula given in Miller’s post, which we’ll use to get 95% confidence interval lower bounds:

rating equation

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

Now here’s the formula for doing Laplace smoothing instead:

(upvotes + \alpha) / (total votes + \beta)

Here \alpha and \beta are parameters that represent our estimation of what rating is probably appropriate if we know nothing else (cf. Bayesian prior). For example, \alpha = 1 and \beta = 2 means that a post with no votes gets treated as a 0.5.

The Laplace smoothing method is much simpler to calculate – there’s no need for statistical libraries, or even square roots!

Does it successfully solve the problems of “wrong solution #1” and “wrong solution #2”? First, the problem with “wrong solution #1”, which we might summarize as “the problem with large sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 209 50 159 0.81 0.7545 0.80
second item 118 25 93 0.83 0.7546 0.82

All the methods agree except for “wrong solution #1” that the second item should rank higher.

Then there’s the problem with “wrong solution #2”, which we might summarize as “the problem with small sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 1 0 1 1.0 0.21 0.67
second item 534 46 488 0.92 0.90 0.92

All the methods agree except for “wrong solution #2” that the second item should rank higher.

How similar are the results for the Wilson method and the Laplace method overall? Take a look: here color encodes the score, with blue at 0.0, white at 0.5, and red at 1.0:

plot of Wilson and Laplace methods

They’re so similar, you might say, that you would need a very good reason to justify the complexity of the calculation for the Wilson bound. But also, the differences favor the Laplace method! The Wilson method, because it’s a lower bound, is negatively biased everywhere. It’s certainly not symmetrical. Let’s zoom in:

plot of Wilson and Laplace methods - zoomed

With the Wilson method, you could have three upvotes, no downvotes and still rank lower than an item that is disliked by 50% of people over the long run. That seems strange.

The Laplace method does have its own biases. By choosing \alpha=1 and \beta=2, the bias is toward 0.5, which I think is reasonable for a ranking problem like this. But you could change it: \alpha=0 with \beta=1 biases toward zero, \alpha=1 with \beta=0 biases toward one. And \alpha=100 with \beta=200 biases toward 0.5 very strongly. With the Wilson method you can tune the size of the interval, adjusting the confidence level, but this only adjusts how strongly you’re biased toward zero.

Here’s another way of looking at the comparison. How do the two methods compare for varying numbers of upvotes with a constant number (10) of downvotes?

Wilson and Laplace methods again

Those are similar curves. Not identical – but is there a difference to justify the complexity of the Wilson score?

In conclusion: Just adding a little bit to your numerators and denominators (Laplace smoothing) gives you a scoring system that is as good or better than calculating Wilson scores.

[code for this post]

Genocide Data

I recently became interested in preventing genocide with data. I think this is not an easy thing to do. I undertook to identify data sources that might be relevant, and thanks to many wonderful people, I can present the following results!

#1. Karen Payne’s “GIS Data Repositories“, “Conflict” section.

Karen has assembled a phenomenal collection of references to datasets, curated within a publicly accessible Google spreadsheet. I’m sure many of the other things I’ll mention are also included in her excellent collection!

This list of data repositories was compiled by Karen Payne of the University of Georgia’s Information Technologies Outreach services, with funding provided by USAID, to point to free downloadable primary geographic datasets that may be useful in international humanitarian response. The repositories are grouped according to the tabs at the bottom

#2. The Global Database of Events, Language, and Tone (GDELT)

Kalev Leetaru of Georgetown University is super helpful and runs this neat data munging effort. There is a lot of data available. The GDELT Event Database uses CAMEO codes; in this scheme, there is code “203: Engage in ethnic cleansing”. There’s also the Global Knowledge Graph (GKG) which may be better for identifying genocide, because one can identify “Material Conflict events that are connected to the genocode theme in the GKG.”

GDELT is now listed on data.gov in coordination with the World Wide Human Geography Data working group.

Jay Yonamine did some work using GDELT to forecast violence in Afghanistan.

#3. The Humanitarian Data Exchange

This new project seems very promising – Javier Teran was very helpful in describing what’s currently available: “datasets on refugees, asylum seekers and other people of concern in our HDX repository that may be useful for your research”. By the time you read this, there may be even more genocide-related data!

#4. Uppsala conflict database

The Uppsala Conflict Data Program (UCDP) offers a number of datasets on organised violence and peacemaking, all of which can be downloaded for free

#5. USHMM / Crisis in Darfur

Max writes:

the National Holocaust Museum has done quite a bit about collecting and visualizing this kind of data. In particular, a few years back they led a large mapping project around Darfur

#6. AAAS Geospatial Technology and Human rights topic

The American Association for the Advancement of Science has a collection of research related to Geospatial Technology and Human rights. Start reading!

#7. Amnesty International

I haven’t looked into what data they might have and make available, but it seems like a relevant organization.

#8. Tech Challenge for Atrocity Prevention

USAID and Humanity United ran a group of competitions in 2013 broadly around fighting atrocities against civilians. You can read about it via PR Newswire and Fast Company. I found the modeling challenge particularly interesting – it was hosted by TopCoder, as I understand it, and the winners came up with some interesting approaches for predicting atrocities with existing data.

#9. elva.org

This is a tip I haven’t followed up on, but it could be good:

Hi, I would reach out to Jonne Catshoek of elva.org, they have an awesome platform and body of work that is really unappreciated. They also have a very unique working relationship with the nation of Georgia that could serve as a model for other work.

#10. The CrisisMappers community

“The humanitarian technology network” – this group is full of experts in the field, organizes the International Conference of Crisis Mappers, and has an active and helpful Google Group. The group is closed membership but welcoming; connecting there is how I found many of the resources here. Thanks CrisisMappers!

#11. CrisisNET

The illustrious Chris Albon introduced me to CrisisNET, “the firehose of global crisis data”:

CrisisNET finds, formats and exposes crisis data in a simple, intuitive structure that’s accessible anywhere. Now developers, journalists and analysts can skip the days of tedious data processing and get to work in minutes with only a few lines of code.

Examples of what you can do with it:

Tutorials and documentation on how to do things with it:

#12. PITF

All I know is that PITF could be some sort of relevant dataset; I haven’t had time to investigate.

 

This document

I’ll post this on my blog, where it’s easy to leave comments with additions, corrections, and so on without knowing git/github, but the “official” version of this document will live on github and any updates will be made there. Document license is Creative Commons Share-Alike, let’s say.

 

More thanks:

  • Thanks of course to everyone who provided help with the resources they’re involved with providing and curating – I tried to give this kind of credit as much as possible above!
  • Special thanks to Sandra Moscoso and Johannes Kiess of the World Bank for providing pointers to number 2 and more!
  • Special thanks to Max Richman of GeoPoll for providing numbers 4, 5, 6, and 7.
  • Special thanks to Minhchau “MC” Dinh of USAID for extra help with number 8!
  • Number 9 was provided via email; all I have is an email address, and I thought people probably wouldn’t want their email addresses listed. Thanks, person!
  • Special thanks to Patrick Meier of iRevolution for connecting me first to number 10!

Data done wrong: The only-most-recent data model

It’s not very uncommon to encounter a database that only stores the most recent state of things. For example, say the database has one row per Danaus plexippus individual. The database could have a column called stage which would tell you if an individual is currently a caterpillar or a butterfly, for instance.

This kind of design might seem fine for some application, but you have no way of seeing what happened in the past. When did that individual become a butterfly? (Conflate, for the moment, the time of the change in the real world and the time the change is made in the database – and say that the change is instantaneous.) Disturbingly often, you find after running a timeless database for some time that you actually do need to know about how the database changed over time – but you haven’t got that information.

There are at least two approaches to this problem. One is to store transactional data. In the plexippus example this could mean storing one row per life event per individual, with a date-time of database execution. The current known state of each individual can still be extracted (or maintained as a separate table). Another approach is to use a database that tracks all changes; the idea is something like version control for databases, and one implementation with a philosophy like this is datomic.

With a record of transactional data or a database that stores all transactions, you can query back in time: what was the state of the database at such-and-such time in the past? This is much better than our original setup. We don’t forget what happened in the past, and we can reproduce our work later even if the data is added to or changed. Of course this requires that the historical records not be themselves modified – the transaction logs must be immutable.

This is where simple transactional designs on traditional databases fail. If someone erroneously enters on April 4th that an individual became a butterfly on April 3rd, when really the transformation occurred on April 2nd, and this mistake is only realized on April 5th, there has to be a way of adding another transaction to indicate the update – not altering the record entered on April 4th. This can quickly become confusing – it can be a little mind-bending to think about data about dates which changes over time. The update problem is a real headache. I would like to find a good solution to this.

Monarch_In_May