(You can find this book online for free – is that legit?)
I like this book quite a lot. It’s a collection of chapters by different authors, and reads something like a series of excellent blog posts. With the exception of chapter 18, it’s quite good. It covers a lot of the issues that arise in practice when gathering and starting to work with data. The explanation of text encoding in chapter 4 could be the best I’ve seen, and chapter 14 (“myths of cloud computing”) is something I wish a lot of people who present themselves as “cloud experts” would read and understand. Philipp K. Janert, author of Data Analysis with Open Source Tools, contributes a very nice chapter as well.
The book closes with a “framework” for data quality, with these “four Cs”:
It’s not bad, this book. I’d recommend it to anyone who needs to work with data in the real world. I think there’s room for even more theory and practice of data cleaning; I’d like to see an even better book yet!