Genocide Data

I recently became interested in preventing genocide with data. I think this is not an easy thing to do. I undertook to identify data sources that might be relevant, and thanks to many wonderful people, I can present the following results!

#1. Karen Payne’s “GIS Data Repositories“, “Conflict” section.

Karen has assembled a phenomenal collection of references to datasets, curated within a publicly accessible Google spreadsheet. I’m sure many of the other things I’ll mention are also included in her excellent collection!

This list of data repositories was compiled by Karen Payne of the University of Georgia’s Information Technologies Outreach services, with funding provided by USAID, to point to free downloadable primary geographic datasets that may be useful in international humanitarian response. The repositories are grouped according to the tabs at the bottom

#2. The Global Database of Events, Language, and Tone (GDELT)

Kalev Leetaru of Georgetown University is super helpful and runs this neat data munging effort. There is a lot of data available. The GDELT Event Database uses CAMEO codes; in this scheme, there is code “203: Engage in ethnic cleansing”. There’s also the Global Knowledge Graph (GKG) which may be better for identifying genocide, because one can identify “Material Conflict events that are connected to the genocode theme in the GKG.”

GDELT is now listed on in coordination with the World Wide Human Geography Data working group.

Jay Yonamine did some work using GDELT to forecast violence in Afghanistan.

#3. The Humanitarian Data Exchange

This new project seems very promising – Javier Teran was very helpful in describing what’s currently available: “datasets on refugees, asylum seekers and other people of concern in our HDX repository that may be useful for your research”. By the time you read this, there may be even more genocide-related data!

#4. Uppsala conflict database

The Uppsala Conflict Data Program (UCDP) offers a number of datasets on organised violence and peacemaking, all of which can be downloaded for free

#5. USHMM / Crisis in Darfur

Max writes:

the National Holocaust Museum has done quite a bit about collecting and visualizing this kind of data. In particular, a few years back they led a large mapping project around Darfur

#6. AAAS Geospatial Technology and Human rights topic

The American Association for the Advancement of Science has a collection of research related to Geospatial Technology and Human rights. Start reading!

#7. Amnesty International

I haven’t looked into what data they might have and make available, but it seems like a relevant organization.

#8. Tech Challenge for Atrocity Prevention

USAID and Humanity United ran a group of competitions in 2013 broadly around fighting atrocities against civilians. You can read about it via PR Newswire and Fast Company. I found the modeling challenge particularly interesting – it was hosted by TopCoder, as I understand it, and the winners came up with some interesting approaches for predicting atrocities with existing data.


This is a tip I haven’t followed up on, but it could be good:

Hi, I would reach out to Jonne Catshoek of, they have an awesome platform and body of work that is really unappreciated. They also have a very unique working relationship with the nation of Georgia that could serve as a model for other work.

#10. The CrisisMappers community

“The humanitarian technology network” – this group is full of experts in the field, organizes the International Conference of Crisis Mappers, and has an active and helpful Google Group. The group is closed membership but welcoming; connecting there is how I found many of the resources here. Thanks CrisisMappers!

#11. CrisisNET

The illustrious Chris Albon introduced me to CrisisNET, “the firehose of global crisis data”:

CrisisNET finds, formats and exposes crisis data in a simple, intuitive structure that’s accessible anywhere. Now developers, journalists and analysts can skip the days of tedious data processing and get to work in minutes with only a few lines of code.

Examples of what you can do with it:

Tutorials and documentation on how to do things with it:

#12. PITF

All I know is that PITF could be some sort of relevant dataset; I haven’t had time to investigate.


This document

I’ll post this on my blog, where it’s easy to leave comments with additions, corrections, and so on without knowing git/github, but the “official” version of this document will live on github and any updates will be made there. Document license is Creative Commons Share-Alike, let’s say.


More thanks:

  • Thanks of course to everyone who provided help with the resources they’re involved with providing and curating – I tried to give this kind of credit as much as possible above!
  • Special thanks to Sandra Moscoso and Johannes Kiess of the World Bank for providing pointers to number 2 and more!
  • Special thanks to Max Richman of GeoPoll for providing numbers 4, 5, 6, and 7.
  • Special thanks to Minhchau “MC” Dinh of USAID for extra help with number 8!
  • Number 9 was provided via email; all I have is an email address, and I thought people probably wouldn’t want their email addresses listed. Thanks, person!
  • Special thanks to Patrick Meier of iRevolution for connecting me first to number 10!

Wrap-up from DC Hack and Tell #4

I’ve been putting together these wrap-ups from Hack and Tell in DC for a while now. They go out to the meetup list and they’re archived on github, but I like them so much I thought I’d put them here too. Working through the back-catalog:

DC Hack and Tell

Round 4: The Christmas Invasion

Time to wrap up the Christmas Invasion and put a bow on it… Here are all the good things we saw, in non-random order!

  • Aaron talked about lots of graphs made from NYC test scores.
  • Rick showed a really neat Medicare visualization that he made, which started as a National Day of Civic Hacking project. (Cool!)
  • Julian demoed the next big programming language, MyCoolLang aka Lebowski, rich with Python and LLVM goodness.
  • Chris fought the good fight against lighswitches, automating his home via his lightbulbs’ port 80 (duh).
  • In addition to inventing languages, Julian also improves existing ones – he showed how he became a Python core dev and improved performance (timing).
  • “So your friendly neighborhood bikeshare station is out of bikes again. What are the odds?” CHRIS WILL SHOW YOU THE ODDS.
  • And Joseph showed some of the magic of saltvagrant, and of course salty vagrant.

Happy solstice, everybody! See you on January 13, 2014!

Wrap-up from DC Hack and Tell #3

I’ve been putting together these wrap-ups from Hack and Tell in DC for a while now. They go out to the meetup list and they’re archived on github, but I like them so much I thought I’d put them here too. Working through the back-catalog:

DC Hack and Tell

Round 3: Hack… to the Future!

And now, a wrap-up… in random order!

  • Mike showed the excellent audioverb for all your language in situ needs – and it even has a youtube explanation too!
  • Aaron talked about rjstat, his R package for reading and writing the JSON-stat data format.
  • Fearless leader Jonathan shared a classic Hack and Tell hack for decoding cryptograms using simple language models and SIMULATED ANNEALING! (I know, right?) It’s called cryptosolver. We miss you, Jonathan!
  • Bayan showed how to simulate fantasy football drafts/seasons in R to test theories and impress your friends! With a Prezi!
  • Tom presented not just the JS live-coding but also super fun interactive statistics and simple-statistics!
  • Aaron also showed this Guess the Letter thing. Oh my gosh there’s a blog post.

And there will be even more good stuff coming soon… to the future!

Unnatural Causes

Unnatural Causes is “a seven-part documentary series exploring racial & socioeconomic inequalities in health” from 2008. In a horrible irony, the episodes are not available to the public. The cheapest way to see it is to pay $24.95 to for streaming. There doesn’t seem to be an option for buying the DVD from the main site unless you are an organization. I believe this is a mistake. The apparent goals of the producers would be better served by making the complete materials publicly available at no cost. You can watch some clips on their YouTube channel, which is good, but why not release everything, together with information on actions to take or links to further information? I don’t even remember where I heard about the series, and it wasn’t particularly easy to track down viewing options. The audience would be so much bigger if energy was devoted to spreading the videos rather than locking them up.

As I have now been lucky enough to see the complete series, here is a brief summary of the episodes:

1. “In sickness and in wealth”: The Whitehall Study is introduced. The Whitehall Study, which is frequently referenced throughout the series, found that health is associated with wealth, not just in a binary way, but in gradations all along the levels of wealth. The importance of a sense of control and a corresponding stress of social subordination are pointed to as people at varying levels of health and wealth are introduced in an American city. Also apparently there was some experiment that gave everybody colds by putting virus right into their noses – is that seriously an experimental technique that people use?

2. “When the bough breaks”: The stress of institutional and persistent racism is identified as a determinant of health. The example of low birth weights for babies born to black mothers is given. Also I noticed that the series is dedicated to the memory of Judy Crichton.

3. “Becoming American”: It is noted that Latino immigrants to America are initially healthier than other Americans, and tight families are given as a potential explanation. Also the Pennsylvania town that hosts the examples has some community center, and a youth center, which seem nice. Then it’s brought to light that immigrant health is much worse after five years, and also there’s some mention of mental illness.

4. “Bad sugar”: A community of Native Americans is the example of the episode, relevant because of very high levels of diabetes. The stress of being displaced by US forces, not dealt fairly with and essentially forced to eat a radically different and inferior diet, as well as the attendant problems of poverty, all contribute.

5. “Place matters”: Biggest takeaway was learning about the original redlining, which gave good home loans almost exclusively to white people from around 1934 to 1962. Grrr. The episode then talks about how bad neighborhoods are stressful; violence, mold, asthma, all suck. Everything is health policy. There’s also a pointing to the failure of private developers to provide what is really needed for people.

6. “Collateral damage”: This episode centers on the Marshall Islands, where US military involvement no longer sends showers of nuclear fall-out, but a base still dominates the economy to ill effect. Overcrowding on the adjacent island, which is essentially a slum compared to the island of the US base, leads to tuberculosis and other ailments. The people of the Marshall Islands can leave their homes and move the US (Arkansas is a popular destination, it seems) but health problems can continue there.

7. “Not just a paycheck”: Electrolux is a Swedish company that moved one factory from Michigan to Mexico and another from Sweden to Hungary. In Michigan this ruined a lot of lives, while in Sweden it was a comparatively small problem. Americans are less well protected by their government and their unions than the Swedes are by theirs, and the Americans have worse health outcomes. The American setting also illustrates increasing inequality as a family laid off from the factory lives on an old family farm that is increasingly surrounded by huge second homes of the rich.

This post was made possible through the generous support of the B. R. Schumacher Foundation.

Here Comes Everybody

Harlan mentioned this book so I read it.


It came out back in 2008 and was a lot more timely then, I imagine.

There are lots of interesting tidbits in here. It’s largely anecdote-based, and it uses the word “suasiontwice. Here are some quotes:

… large social systems cannot be understood as a simple aggregation of the behaviors of some nonexistent “average” user.

… it’s easier to like people who are odd in the same ways you are odd, but it’s harder to find them.

… trying something is often cheaper than making a formal decision about whether to try it.

… the question “Do the people who like it take care of each other?” turns out to be a better predictor of success than “What’s the business model?”

Shirky also brings up the Bill Joy quote, “No matter who you are, most of the smart people work for someone else.” This made me wonder whether Google agrees, these days.

I like reciprocal altruism a lot: “With reciprocal altruism, favors are exchanged without formal bookkeeping …” (emphasis mine). This is my preferred way of doing things. The problem seems to be the number of people and anonymity online, and so there are systems with formal bookkeeping like eBay’s buyer/seller rating system, or points on StackOverflow. Is this the direction that everything is moving in? If we end up with zero privacy/anonymity online, will that solve the problem of freeloaders and other bad behavior?

Things I hadn’t previously heard of: asmallworld (gross), Dodgeball (people are still doing this stuff). Also Richard Gabriel‘s Worse Is Better talk (increasingly it seems LISP people have all the ideas).

Maybe the most interesting bit from the book was this forward-looking claim:

So here’s a hypothesis about the near future, based on little more than a hunch and some tantalizing examples: we’re about to experience a revolution in collective action, and the driver of that revolution will be new legal structures that will support productive collective action.

I don’t know if that has happened, or if it is happening. Shirky pointed out that intellectual property was the main collective product at the time of his writing – things like Linux and Wikipedia, where licenses like the GPL protect the product. The only things I think of that are beyond software and writing are products that get kickstarted, for example, and I don’t know if that counts. Restricting to financial structures seems unfortunate. But crowd-funding and anonymous currencies like BitCoin might be the closest thing to steps in this direction, as far as I can see. Meetup was in the book, and doesn’t have any special legal structures for organizations as far as I know. What else am I missing?

Quizz Quotes

I was exploring Google Papers the other day and came across Quizz: Targeted Crowdsourcing with a Billion (Potential) Users by Ipeirotis and Gabrilovich. Downside: occasionally reads like a Google ad. Upside: really interesting results from an experimental Q&A system which is still live. It’s very cool. Here are some quotes with my commentary:

… the strong self-selection of high-quality users to continue contributing, while low-quality users self-select to drop out.

… there is little incentive for unpaid users to continue participating when there is no monetary reward and they are not good at the task.

The goal of the system was not educational, so they celebrate the fact that it isn’t fun if you suck.

These results indicate that users may be more interested in learning about the topic rather than just knowing whether they answered correctly.

The results included that people answer more questions when the interface shows the correct answer as “feedback” rather than just showing “correct” or “incorrect.” This section of experimental results was particularly interesting, including commentary on possible failures of leaderboards.

… as more and more users participate, the achievements of the top users are difficult to match, effectively discouraging users from trying harder.

They did say that a leaderboard including only the last week’s worth of results was more effective.

I’m less interested in the application of this kind of system for crowd-sourcing information, more interested in educational applications, but there is some clear overlap, and cited papers such as The multidimensional wisdom of crowds seem very interesting. Also through Ipeirotis’ blog I found out about Smarterer, which is interesting as well. There’s some sort of spectrum, or multi-dimensional thing going on, with education, crowdsourcing, and evaluation all in the mix.

The authors’ application of information gain and a Markov Decision Process are also interesting.

Writing to think: Questions on the web

I have made some things online that involve “asking and answering questions” in the traditional multiple-choice-test way. I built the software to do that (with Python on Google App Engine, again differently with node.js on Heroku) both times.

Is there any “built in” web element for questions and answers of the types I’m thinking of? There are HTML forms. HTML forms provide pretty much flexibility, and even start to have some functionality for different question structures – radio buttons for a single choice vs. checkboxes for multiple selections. But HTML forms, being just HTML, have pretty clear limits. Javascript can add some more functionality, and then eventually you need a web server backend of some kind to support more.

There are web services like Google Forms and SurveyMonkey, and the very task-specific Doodle, which take all of HTML/Javascript/backend and run it all for you. This means that the available functionality is whatever they provide, everything is hosted by them, and as far as I know there is little or no mechanism for creating things outside of their web GUIs.

The popular services just mentioned mostly collect information without any feedback; when you want to have a “correct” answer there isn’t much functionality. Where is a good existing solution? There’s internet detritus like There’s Quizlet, which seems pretty neat but also isolated perhaps by its attempt to chase education spending. (It also supports, like most education sites, an unhealthy distinction between student and teacher.)

The desire for profit seems to poison projects that could otherwise have a broader positive effect. Projects affiliated with the very cool JiTT methodology disappeared into companies. I’m not even sure what sort of thinking led to the closing of the Khan Academy source.

But it isn’t just the profit motive that keeps question-and-answer technology balkanized; there’s no real standard, and I don’t think it’s very easy to come up with one. The systems I built aren’t easily transferred anywhere for use by others, for example. This is my fault, but I also don’t think it’s a very easy thing to design.

There are some attempts at standards for questions, at least. BlackBoard has a way to load questions from some tab-delimited formats. Moodle has something called GIFT. There’s the Question and Test Interoperability spec, which is such a huge mess you need to employ a stapler guy to support it. And there’s something called QUOX. Oh my.

And these are all purely for assessment, where earlier there were some purely for survey/data collection. It seems to me that they shouldn’t be so different. Fundamentally isn’t it all just questions?

Another take on this, I suppose, is sites like Stack Overflow, which represent a different sort of questioning. And there is OSQA, “the Open Source Q&A system”, which is cool. You could run that on your server, or for that matter run Moodle, or some survey platform, most likely. So that’s also another delivery model: the run-your-own-server-with-pre-built-software model. A lot of setup/maintenance overhead, and still not a lot of interoperability as far as I can tell. (OSQA is also available hosted.)

Just one more: There are also frameworks for building assessments, which try to generalize while still providing some structure. I was happy to find out about the one linked, for Rails; I don’t know if there are others or if any are widely used.

Markdown is pretty much the best thing ever. (Note to self: get off wordpress…) Can we come up with a markdown solution to the question problem? Something super light-weight, that blends easily into text files that humans would actually write…

The kramdown (etc.) markdown extension for definition lists seems like a candidate. Here’s how it works:

This is the "term".
: This is the "definition".

Get’s rendered something like this, using the standard HTML definition list tags:

This is the “term”.
This is the “definition”.

So let’s say the term is the question, and the (possibly many) definitions are answer choices. Of course we could have a blank definition represent a text box (or text area):

What do you think?

A multiple-choice survey could be as easy as this then:

What's your favorite color?
: red
: blue
: green

To add correctness functionality, a little more syntax could be added:

Sugar is sweet.
: true*
: false

The idea here is that these text files would be rendered into interactive HTML/Javascript such that you wouldn’t see which was the correct answer – you would select an answer, possibly have a submit button of some kind, and get feedback on whether your answer agreed with the one in the text. I do think that teacherly paranoia about “test security” is one thing that prevents good functionality from spreading much on the web. Nobody wants to share their oh-so-secret correct answers, lest the horrible children cheat. I think this perspective is a disease on society.

Maybe this could be a short answer question:

What is the capital city of Wisconsin?

Of course you have the problems of evaluating text answers (Is “Madison, WI” also correct? etc.). Generally, there is of course an awful lot of functionality that you want from questions, and it may be hard to reduce it all down. Some things should be obvious: true and false is a special case of multiple choice. But other things like scoring, when/whether to show the correct answer, etc. seem difficult to abstract very far.

The text questions could be rendered as stand-alone HTML/Javascript, or to connect with (or even be hosted on) some sort of web system. More details would have to be worked out.

The illustrious Ramnath, who always seems to be doing cool things several years before I know about them, has thought about this markdown question idea to some degree. I want to find out more about what he’s done.