How to calculate Gini Coefficient from raw data in Python

The Gini Coefficient is a measure of inequality. It’s well described on its wiki page and also with more simple examples here.

I don’t find the implementation in the R package ineq particularly conversational, and also I was working on a Python project, so I wrote this function to calculate a Gini Coefficient from a list of actual values. It’s just a fun little integration-as-summation. Not bad!

def gini(list_of_values):
  sorted_list = sorted(list_of_values)
  height, area = 0, 0
  for value in sorted_list:
    height += value
    area += height - value / 2.
  fair_area = height * len(list_of_values) / 2
  return (fair_area - area) / fair_area

To me this is fairly readable and maps nicely to the mental picture of adding up the area under the Lorenz curve and then comparing it to the area under the line of equality. It’s just bars and triangles! And I don’t think it’s any less performant than the ineq way of calculating it.

(update: lalala, I think there are some edge cases where the standard way of calculating gini and this way are not in agreement; I’ll look into it if I ever think about this again – feel free to figure it out and leave a comment!)

R-Squared and Adjusted R-Squared for a Linear Model

We would like to make a linear model such that the residuals – how far off the predictions of the model are, for the training data – are small. If the residuals are small, the model is doing a good job of predicting. We’re always minimizing the residuals, with OLS, but how can we tell if, in the end, the residuals are small?

We can measure the “bigness” of the residuals by their variance. Note that residuals will never be big numbers and yet have small variance, because if that happened we would change the linear model to further minimize the residuals. For example, if the residuals are all one, just add one to the constant term of the model and you reduce all the residuals to zero.

We want our measure of residual smallness to be the same regardless of whether we’re working in light-years or millimeters, so we need to scale by something. We scale by the variance of the values we’re predicting (the labels).

So now we have a measure of residual smallness that is equal to zero if all the residuals are zero and the model is “perfect”, and equal to one if the model is constant (this is easy to see if, for example, the model always predicts that the label is zero). This measure of residual smallness is then in some sense the percentage of variance (in the labels) that the model does not explain (it’s still in the residuals).

We would usually prefer to talk about a measure that gets bigger when we do a better job, so we take one and subtract the measure of residual smallness to get R^2, the percentage of variance explained. (Consider how you might try to define this differently.)

R^2 = 1 - \frac{var(residuals)}{var(labels)}

It turns out that if you add a bunch of random predictors to a model you can get R^2 to go up and up without having it mean anything. Adjusted R-Squared tries to account for this by penalizing the measure when there are more predictors p (not counting the constant term), relative to the number of training examples n.

adjusted R^2 = 1 - \frac{var(residuals)}{var(labels)} \cdot \frac{n-1}{n-p-1}

When p gets bigger, that denominator gets smaller, so a larger thing is subtracted, so adjusted R^2 goes down. Adjusted R^2 should help discourage you from adding predictors all willy-nilly. (Could you have a negative adjusted R^2? Could you have an adjusted R^2 greater than one? Play around with these ideas. Try it in R.)

Note one: this R^2 is quite related to this r squared.

Note two: this R^2 is a measure of training error. Often you’ll be more concerned with how the model performs on data it wasn’t created with.