Spotfire (who owns a web data analysis package and now also S-Plus), recently posted on petabyte databases, and I started wondering if petabyte databases would come to clinical research. The examples they provided–Google, Large Hadron Collider, and World of Warcraft/Avatar–are nearly self-contained data production and analysis systems in the sense that nearly the whole part of the data collection and storage process is automated. This property allows the production of a large amount of high-quality data, and our technology has gotten to a point where petabyte databases are possible.
By contrast, clinical research has a lot of inherently manual processes. Even with electronic data capture, which has usually improved the collection of clinical data, the process still has enough manual parts to make the database fairly small. Right now, individual studies have clinical databases on the order of tens of megabytes, with the occasional gigabyte database if a lot of laboratory data is collected (which does have a bit more automation to it at least on the data accumulation and storage end). Large companies having a lot of products might have tens of terabytes of storage, but data analysis only occurs on a few gigabytes at a time at the most. At the FDA, this kind of scaling is more extreme as they have to analyze data from a lot of different companies on a lot of different products. I don't know how much storage they have, but I can imagine that they would have to have petabytes of storage, but on the single product scale the individual analyses focus on a few gigabytes at a time.
I don't think we will hit petabyte databases in clinical research until electronic medical records are the primary data collection source. And before that happens, I think the systems that are in place will have to be standardized, interoperable, and simply of higher quality than they are now. By then, we will look at the trails that Google and LHC have blazed.