I want an edit button on everything

My blog does not have an edit button for people who are not me. This means it takes a bunch of work to fix a typo, for example: you’d have to tell me about it, describing where the typo is, somehow, and then I would have to go find the spot, and make the change. In practice, this pretty much doesn’t happen.

Wikipedia has edit buttons on everything, and so does github. I’m not entirely sure what is best between allow-all-edits-immediately and require-review-for-all-edits. Some mix is also possible, I guess. Wikipedia has ways to lock down articles, and must have corresponding permission systems for who can do/undo that. Github lets you give other people full edit permissions, so you can spread the editor pool at least. Git by itself can support even more fine-grained control, I believe.

I’d like to move my blog to something git-backed, like github pages. It’s a little work, but you can put an edit button on the rendered HTML views shown by github pages too. Advanced R has a beautiful “Edit this page” button on every page. Three.js has one in their documentation. Eric points out Ben’s blog as well, and also the trouble with comments.

Ideally I’d prefer not to be github-bound, I guess, or bound to some comment service. But I also kind of prefer to have everything text-based, so what do you do for comments then? And also I’d like to be able to do R markdown (etc.) and have that all render automagically. But also something serverless. I’m drawn to this javascript-only static static generator, but that also seems to be A Bad Idea.

So: that solves that.

Action and Prioritization, Advertising and Intervention

Amazon can easily show you a product of their choice while you’re on their site. This is their action. Since it’s so easy to show you things, it makes sense to work a lot on choosing carefully what to show. This is their prioritization. Refer to this class of ranking problem as the advertising type.

It is fairly difficult to send food aid to a village (action) or to support and improve a challenged school (action). A deficiency of both knowledge and resources motivates a need to choose where to give attention (prioritization). Refer to this type of ranking problem as the intervention type.

Advertising problems are essentially scattershot and we only care about whether we hit something, anything. All you need is one good display, perhaps, and you make a sale. These prioritization choices also happen incredibly frequently; they demand automation and preclude individual human evaluation. It doesn’t matter because you only need to succeed on average.

Intervention problems, on the other hand, have a million challenges. Even a perfect solution to the prioritization problem does not guarantee success. Careful action will be required of multiple people after a prioritization is complete, and meaningful reasons for prioritization choices will be helpful, likely required. It is inhumane to think of “success on average” for these problems.

These two types of problems are different and demand different approaches. However, advertising-type problems and the focus and success of “advertisers” on the prioritization part of their work is influencing choices in how problems of both type are approached.

We’re spending too much effort on prioritization, sometimes even mistaking prioritization for action. Why?

1. Prioritization as a way to maximize impact. Certainly it’s good to maximize impact, but it also reflects an abhorrent reality: that we lack the capability to impact everywhere that has need. While it’s good to direct aid to the neediest villages, the need to do that prioritization is a sign that we are choosing to leave other needy villages without aid. We should not forget that we are not solving the problem but only addressing a part of it.

2. Selfish prioritization. Realizing a lack of resources (good schools, good housing, etc.) we wish to identify the best for ourselves. This can appear in guises that sound well-intentioned, but it is fundamentally about some people winning in a zero-sum game while others lose.

3. Prioritization because we don’t know how to take action. This is dangerous because we could let prioritization become our only hope while no resources are directed to solving the problem. While information can drive action eventually, there are lots of problems for which the only thing that will help is a solution (action), not just articulating and re-articulating the problem (prioritization).

I think we need to work more on actions. We need to develop solutions that do not perpetuate a zero-sum game but that improve conditions for everyone. We still need prioritization, but we should be aware of how it fits in to solving problems. Important actions are hard to figure out.

A micro-intro to ggmap

This describes what we did in the break-out session I facilitated for the illustrious Max Richman‘s Open Mapping workshop at Open Data Day DC. For more detail, I recommend the original paper on ggmap.

ggmap is an R package that does two main things to make our lives easier:

  • It wraps a number of APIs (chiefly the Google Maps API) to conveniently facilitate geocoding and raster map access in R.
  • It operates together with ggplot2, another R package, which means all the power and convenience of the Grammar of Graphics is available for maps.

To install ggmap in R:

install.packages("ggmap")

Then you can load the package.

library(ggmap)

## Loading required package: ggplot2

One thing that ggmap offers is easy geocoding with the geocode function. Here we get the latitude and longitude of The World Bank:

address <- "1818 H St NW, Washington, DC 20433"
(addressll <- geocode(address))

##      lon  lat
## 1 -77.04 38.9

The ggmap package makes it easy to get quick maps with the qmap function. There are a number of options available from various sources:

# A raster map from Google
qmap("Washington, DC", zoom = 13)

1

# An artistic map from Stamen
qmap("Washington, DC", zoom = 13, source = "stamen",
     maptype = "watercolor")

2

Since we were at The World Bank, here’s a quick map showing where we were. This shows for the first time how ggplot2 functions (geom_point here) work with ggmap.

bankmap <- qmap(address, zoom = 16, source = "stamen",
                maptype = "toner")
bankmap + geom_point(data = addressll,
                     aes(x = lon, y = lat),
                     color = "red",
                     size = 10)

3

To connect with Max’s demo, we can load in his data about cities in Ghana.

ghana_cities <- read.csv("ghana_city_pop.csv")

We’ll pull in a Google map of Ghana and then put dots for the cities, sized based on estimated 2013 population.

ghanamap <- qmap("Ghana", zoom = 7)
ghanamap + geom_point(data = ghana_cities,
  aes(x = longitude, y = latitude,
      size = Estimates2013), color = "red") +
  theme(legend.position = "none")

4

Another useful feature to note is the gglocator function, which let’s you click on a map and get the latitude and longitude of where you clicked.

gglocator()

This is all the tip of the iceberg. You’ll probably want to know more about ggplot2 if you’re going to make extensive use of ggmapRMaps is another (and totally different) great way to do maps in R.

This document is also available on RPubs.

A shared playground on Ubuntu

After setting up a machine, I’d like to set up a bunch of users who can log in and give them a common space in which to do some work. The goal is convenience for demonstration and education.

Assume usernames are in a file called names.txt, one per line. This will create users with those names, put them in the users group, and make their passwords “none“. As root:

cat names.txt | while read line
 do
  adduser --gecos "" --disabled-password $line
  adduser $line users
  echo $line:none | chpasswd
 done

Now those users should really log in and change their passwords with passwd. Up next, we make a shared directory that everybody has access to.

mkdir /home/shared
chgrp users /home/shared
chmod g+w /home/shared
chmod g+s /home/shared

That makes the directory, sets the group to users, gives group members write access, and sets the “sticky” bit so that files created in the directory will have the users group.

Clean data with R

This is the content of the talk I did for the 2014-01-08 meetup of Data Wranglers DC. I agree with Tufte on PowerPoint, so I wrote out most of what I wanted to say as a couple blog posts.

The slides are mostly goofy pictures in front of which to talk about the above. The first slide is blank:

The API the FEC Uses

[I wrote this for Gelman’s 365 Days in Statistical Lives project, but the Money in Politics Data Fest is tomorrow, and there is some small chance that this might be useful for someone. Excuse the flowery details. In one sentence: The FEC doesn’t officially have a public API, but they do have a publicly accessible API that you can figure out. Scroll down to code font to see an example.]

I work at NYU. My job title is Senior Data Services Specialist. Data Services is a collaboration between Division of Libraries and Information Technology Services, and what we do is support members of the university community, largely graduate students, in everything from “Where can I get some data?” to “How do I get this last graph to look right?”

So we get an email from a PhD student in politics:

I would like to download data on campaign spending from the website of the Federal Election Commission, http://www.fec.gov/finance/disclosure/candcmte_info.shtml. I would like to download information on about 500 candidates for the House of Representatives, i.e. ca. 500 files. The files are under the category “Report Summaries” and should cover the “two-year period” labeled by “2010”. The files that I am looking for do not seem to be accessible through a single download. At the moment, the only way that I can download the information is therefore to enter the name of the candidate, e.g. “Akin” and download the information individually which takes a long time. Would you be able to help me to speed up this process, perhaps by writing a script which would download the files based on a list of candidates?

My first thought, of course, is that he must be wrong. But I go and look, and sure enough there doesn’t seem to be a combined file for quite what he wants. What the heck, FEC?

Generally, our librarians can help people find existing data sets, and our specialists can help people use software for their analyses. Web scraping is not really part of anybody’s job description, but… maybe we do it now? I like the internet, so I decide to give it a try.

Ideally, I can do better than browser-equivalent scraping, if I can figure out what’s really going on; the site gives you a CSV file after you click around for a while. Viewing source is useless, naturally. Using the Inspector in Chrome I can dig through a miserable hierarchy and eventually see that I need to know what the JavaScript function “exportCandCmteDetailCurrentSummary()” is doing. I look through a couple JavaScript files that I can see are getting pulled in. I look and I look but I can’t find a definition for this function.

I don’t even like JavaScript, but I know there’s an HTTP request in there somewhere that I can emulate without the horrible web site, so I decide to download Wireshark and just sniff it out. I should have started with this, because it’s super easy to find, and before long I can get the files we want, by candidate ID, using cURL.

curl -X POST -F "electionYr=2010" -F "candidateCommitteeId=H0MO02148" -F "format=csv" -F "electionYr0pt=2010" http://www.fec.gov/fecviewer/ExportCandidateCommitteeCurrentReport.do

Stick that in a little shell loop, and you can download all the files you want.

Time to meet with the patron. Of course he uses Windows, and after we wait for Cygwin to install, cURL is mysteriously unable to find a shared object file. Fine. We can use a lab machine, which does work. We get all the files downloading.

Then we have to figure out how to combine all the files together. The role of Data Services is really to help people do things themselves, not just do things for people completely. Teach a man to fish, Program or Be Programmed… The requester has worked a little bit in Python, with some help, to do similar things in the past, so that’s a possibility. It looks like he has Stata running on his laptop, and if he wants to use that I may have to hand him off to one of our Stata experts. But it turns out that he used R quite a lot while getting his Masters, so maybe we can use that.

Luckily for everyone, he’s reasonably competent with R and we’re able to start writing a script together that will loop through all the candidate IDs that he’s interested in and grab the data he needs from all the files. Of course the files aren’t 100% consistent in their format and contents, so there are some issues to work out. I get to teach him what grep is.

At this point, it looks like he’s going to be able to finish cleaning up his data and move on to his real focus – some sort of analysis that may shine light on the nature of politics in America. He may be the first person ever to do any sort of analysis with these particular numbers. At this point I can feel good about going to lunch.

1,236 multiple-choice MCAS math items

The items are all available on github; this post duplicates the repo’s README:

A friend and I went through all the MCAS math exam PDFs available as of January 2013 and used the Mac screenshot function to create PNG images for every multiple-choice question that didn’t explicitly require some additional resource, like a ruler. We recorded the correct answers in simple text files. This produced 1,236 items from 53 exams offered from 2007 to 2012 for grades 3, 4, 5, 6, 7, 8, and 10.

I plan to use these items in a project I’m working on, but I’d like to make them available in this format for others as well, so that the work we did is not needlessly duplicated. Thanks to the sensibleness of the Massachusetts Department of Elementary and Secondary Education, re-use is allowed. Their complete copyright message is copied at the bottom of this README.

Each exam administration has its own folder. The folders are named like s2008g10math.

  • The first character indicates when the exam is from. The spring administration (s) is when almost all exams are offered. There are also March (m) and November (n) re-tests for tenth grade.
  • The next four characters indicate the year of the exam.
  • The g is a handy delimiter meaning “grade” and the digits that follow it represent the grade level of the exam.
  • The last four characters are always math because that’s all we looked at for this.

Within each exam administration folder are a bunch of PNG files and a key.txt text file. The PNG filenames include a date and time that orders them correctly. Be careful – the correct ordering is not always produced by alphabetizing the filenames. The text file key.txt includes one answer per line (A, B, C, or D) and sometimes comments starting with a hash. Comments like “#calc” were meant to indicate the start of the exam section that allowed calculators. These are mostly correct, but I gave up caring about the calculator rules, so I’m not that concerned with whether they’re right. In my further processing I just completely ignore comments.

I am not in any way affiliated with Masachusetts or its education system. I just hope that by making these items available I might in some small way contribute to improving education overall. Use these items well!

~ Aaron Schumacher


Massachusetts Department of Elementary and Secondary Education copyright notice:

Permission is hereby granted to copy for non-commercial educational purposes any or all parts of this document with the exception of English Language Arts passages that are not designated as in the public domain. Permission to copy all other passages must be obtained from the copyright holder. Please credit the “Massachusetts Department of Elementary and Secondary Education.”