Data done wrong: The only-most-recent data model

It’s not very uncommon to encounter a database that only stores the most recent state of things. For example, say the database has one row per Danaus plexippus individual. The database could have a column called stage which would tell you if an individual is currently a caterpillar or a butterfly, for instance.

This kind of design might seem fine for some application, but you have no way of seeing what happened in the past. When did that individual become a butterfly? (Conflate, for the moment, the time of the change in the real world and the time the change is made in the database – and say that the change is instantaneous.) Disturbingly often, you find after running a timeless database for some time that you actually do need to know about how the database changed over time – but you haven’t got that information.

There are at least two approaches to this problem. One is to store transactional data. In the plexippus example this could mean storing one row per life event per individual, with a date-time of database execution. The current known state of each individual can still be extracted (or maintained as a separate table). Another approach is to use a database that tracks all changes; the idea is something like version control for databases, and one implementation with a philosophy like this is datomic.

With a record of transactional data or a database that stores all transactions, you can query back in time: what was the state of the database at such-and-such time in the past? This is much better than our original setup. We don’t forget what happened in the past, and we can reproduce our work later even if the data is added to or changed. Of course this requires that the historical records not be themselves modified – the transaction logs must be immutable.

This is where simple transactional designs on traditional databases fail. If someone erroneously enters on April 4th that an individual became a butterfly on April 3rd, when really the transformation occurred on April 2nd, and this mistake is only realized on April 5th, there has to be a way of adding another transaction to indicate the update – not altering the record entered on April 4th. This can quickly become confusing – it can be a little mind-bending to think about data about dates which changes over time. The update problem is a real headache. I would like to find a good solution to this.

2 thoughts on “Data done wrong: The only-most-recent data model”

aaronschumacher says:

April 5, 2014 at 11:06 pm

Harlan notes the relation to these ideas: http://en.wikipedia.org/wiki/Slowly_changing_dimension I don’t think any of those really address what I’m trying to get at here; they’re both too technology-specific (to SQL) and not sufficiently domain-general (to a full range of applications).

aaronschumacher says:

April 5, 2014 at 11:09 pm

Another possible complication: The chain of reference for data sourcing can be longer. For example, something happens and then somebody tells you about it and you put it in the database. Then later they tell you something different about it. How does the database reflect all of this? Should it just collapse the chain of provenance?

Plan Space from Outer Nine

education, data, and the internet

Data done wrong: The only-most-recent data model

2 thoughts on “Data done wrong: The only-most-recent data model”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Data done wrong: The only-most-recent data model”

Leave a comment Cancel reply