Peeps Be Askin’ Me

Peeps be all the time askin’ me, what are the excellent tech/data things to do in DC? Where are the cool people to be found? What’s good?

Well look:

For data talks and socializing, get with the meetups listed on Data Community DC’s “Speaking Events” page.

For hanging out and hacking on projects that help the world be more awesome, you know you have to be down with Code for DC.

And of course there’s Hack and Tell, but you already know.

Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = '' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:

Writing to think: Questions on the web

I have made some things online that involve “asking and answering questions” in the traditional multiple-choice-test way. I built the software to do that (with Python on Google App Engine, again differently with node.js on Heroku) both times.

Is there any “built in” web element for questions and answers of the types I’m thinking of? There are HTML forms. HTML forms provide pretty much flexibility, and even start to have some functionality for different question structures – radio buttons for a single choice vs. checkboxes for multiple selections. But HTML forms, being just HTML, have pretty clear limits. Javascript can add some more functionality, and then eventually you need a web server backend of some kind to support more.

There are web services like Google Forms and SurveyMonkey, and the very task-specific Doodle, which take all of HTML/Javascript/backend and run it all for you. This means that the available functionality is whatever they provide, everything is hosted by them, and as far as I know there is little or no mechanism for creating things outside of their web GUIs.

The popular services just mentioned mostly collect information without any feedback; when you want to have a “correct” answer there isn’t much functionality. Where is a good existing solution? There’s internet detritus like There’s Quizlet, which seems pretty neat but also isolated perhaps by its attempt to chase education spending. (It also supports, like most education sites, an unhealthy distinction between student and teacher.)

The desire for profit seems to poison projects that could otherwise have a broader positive effect. Projects affiliated with the very cool JiTT methodology disappeared into companies. I’m not even sure what sort of thinking led to the closing of the Khan Academy source.

But it isn’t just the profit motive that keeps question-and-answer technology balkanized; there’s no real standard, and I don’t think it’s very easy to come up with one. The systems I built aren’t easily transferred anywhere for use by others, for example. This is my fault, but I also don’t think it’s a very easy thing to design.

There are some attempts at standards for questions, at least. BlackBoard has a way to load questions from some tab-delimited formats. Moodle has something called GIFT. There’s the Question and Test Interoperability spec, which is such a huge mess you need to employ a stapler guy to support it. And there’s something called QUOX. Oh my.

And these are all purely for assessment, where earlier there were some purely for survey/data collection. It seems to me that they shouldn’t be so different. Fundamentally isn’t it all just questions?

Another take on this, I suppose, is sites like Stack Overflow, which represent a different sort of questioning. And there is OSQA, “the Open Source Q&A system”, which is cool. You could run that on your server, or for that matter run Moodle, or some survey platform, most likely. So that’s also another delivery model: the run-your-own-server-with-pre-built-software model. A lot of setup/maintenance overhead, and still not a lot of interoperability as far as I can tell. (OSQA is also available hosted.)

Just one more: There are also frameworks for building assessments, which try to generalize while still providing some structure. I was happy to find out about the one linked, for Rails; I don’t know if there are others or if any are widely used.

Markdown is pretty much the best thing ever. (Note to self: get off wordpress…) Can we come up with a markdown solution to the question problem? Something super light-weight, that blends easily into text files that humans would actually write…

The kramdown (etc.) markdown extension for definition lists seems like a candidate. Here’s how it works:

This is the "term".
: This is the "definition".

Get’s rendered something like this, using the standard HTML definition list tags:

This is the “term”.
This is the “definition”.

So let’s say the term is the question, and the (possibly many) definitions are answer choices. Of course we could have a blank definition represent a text box (or text area):

What do you think?

A multiple-choice survey could be as easy as this then:

What's your favorite color?
: red
: blue
: green

To add correctness functionality, a little more syntax could be added:

Sugar is sweet.
: true*
: false

The idea here is that these text files would be rendered into interactive HTML/Javascript such that you wouldn’t see which was the correct answer – you would select an answer, possibly have a submit button of some kind, and get feedback on whether your answer agreed with the one in the text. I do think that teacherly paranoia about “test security” is one thing that prevents good functionality from spreading much on the web. Nobody wants to share their oh-so-secret correct answers, lest the horrible children cheat. I think this perspective is a disease on society.

Maybe this could be a short answer question:

What is the capital city of Wisconsin?

Of course you have the problems of evaluating text answers (Is “Madison, WI” also correct? etc.). Generally, there is of course an awful lot of functionality that you want from questions, and it may be hard to reduce it all down. Some things should be obvious: true and false is a special case of multiple choice. But other things like scoring, when/whether to show the correct answer, etc. seem difficult to abstract very far.

The text questions could be rendered as stand-alone HTML/Javascript, or to connect with (or even be hosted on) some sort of web system. More details would have to be worked out.

The illustrious Ramnath, who always seems to be doing cool things several years before I know about them, has thought about this markdown question idea to some degree. I want to find out more about what he’s done.

doge coding: much wow

I have recently come across two more or less doge-titled educational resources for coding. This definitely constitutes a trend.

happy sun

First up is Learn You a Haskell for Great Good!. I’m pretty sure the title includes the exclamation point. It’s a free book about Haskell, of course. (You can also buy it if you want.)

Last up is Learn You The Node.js For Much Win!. Same deal with the exclamatory title. This one is a command-line interactive tutorial about node.js that runs on workshopper. I found out about this after first hearing about a similar thing for git called git-it.

I, for one, would love to see these somehow form the basis for an entire line of amusingly titled “Learn you” books (and so on).

Use counts, not percentages

Consider this data:

total    part   percent
  765      47        6%

Clearly, there is some redundancy. Both part and percent express the same thing.

With infinite precision, you could use either part or percent at your pleasure. However, in the common case where the counts (total and part) are integers and the percentage(s) are not, computers will store the integers generally much more nicely and compactly than nasty decimal things (floats or string representations).

Percentages also commonly get rounded off, in which case information is lost. In the above example, 6% of 765 could be anything from 43 to 49, and possibly even more depending on what precision is used for the calculation.

The moral of the story is that for data, you should always use counts, not percentages.

Number-line comparisons are good for code

You have a lot of flexibility in how you write comparisons with most programming languages. Typically, for example, you can use either of x < 5 and 5 > x.

I recommend the convention of writing comparisons as if they come from a number line. That is, always use “<” and “<=“, and never use “>” and “>=“.

This convention reduces cognitive load – it makes code easier to read and write. This is particularly true when testing ranges, and I find it  to make an exceptionally large difference in the common case of testing date ranges. Compare the following two lines of pseudo-code:

date > 2010-04-23 & date < 2010-08-11 # bad
2010-04-23 < date & date < 2010-08-11 # good

The second line is much more readable than the first. Indeed, even less clear versions are possible. It becomes easy to introduce errors, and even not unlikely that you’ll be testing for impossible ranges.

In some languages (notably Python) you can write chained comparisons like 3 < x < 5 and it will work as expected. In some languages you can write chained comparisons like that and it will evaluate but probably not as you intended. In JavaScript, 3 < 4 < 2 is true. (wat) In Ruby and R, chained comparisons like these will give you an error. So I prefer the style already shown, with the variable being tested close to the “and” operator joining the two comparisons.

Many languages use “=” for assignment and “==” for testing equality, so it has been noted that 5 == x is safer than x == 5 in the sense that if you mistakenly write x = 5 then you’ve broken something, but 5 = x is just an error that will get caught. I’m not terribly concerned about this. In both Python and R, the equivalent of “if x = 5” is an error.

Since R uses “<-” for assignment, there is a similar possible problem:

x < -2 # compares x to -2
x <-2  # assigns -2 to x

This is indeed annoying. (Thanks Tommy for pointing it out.) I think the advantages of writing nice number-line comparisons outweigh this risk in R, but it is the most compelling argument I’ve seen for not using “<-” for assignment.