I'll give a recent example. I was asked to calculate the number of days a person was supposed to take a drug. We had the start date and end date, and so it should have been easy to do end - start + 1. However, to complicate matters, we were asked to consider days when the investigator told the subject to lay off the drug. This data was collected as free text. So, for example, the data could show up as follows:
- 3 days beginning 9/22/2009
- 9/22/2009-9/24/2009
- 9/22-24/2009
- Sept 22, 2009 - Sept 24, 2009
But we should not have had to. One person reviewing the data collection with the knowledge that this data would have to be analyzed would have immediately and strongly recommended that the data be collected in a structured format for the statistician to analyze at the end of the trial.
It is with great interest that I note that this problem is much wider. This blog post suggests a possible reason: problems of the past had to do with hidden information or data, but modern problems have to do with problems hidden within data that is in plain sight (a hypothesis of Malcolm Gladwell and probably many others). That is, in the past, having the data was good enough. We did not have space to store huge amounts of data, and certainly not the processing power to sift through all of it. Now, we have the storage and the processing power, but our paradigm of thinking about data has not kept up. We are still thinking that all we need is to have it, when what we really need is to analyze it, discard what's irrelevant, and correctly interpret what is there.
And that's why Hal Varian regards statistics as the "sexy job" of the next decade.