Wednesday, July 28, 2010

Using ODBC and R to analyze Lotus Notes databases (including email)

For several reasons, I want to analyze data that comes out of Lotus Notes. One kind of such data, of course, is email. So here's how I did it. This requires MS Windows, but it may be possible to do this on Linux and Mac, because IBM supports those platforms as well. Also, I'm sure that other solutions exist for other email platforms, but I won't go into that here.
  1. Download NotesSQL, which is an ODBC (open database connectivity) driver for Lotus Notes. In a nutshell, ODBC allows most kind of databases, such as Oracle, MySQL, or even Microsoft Access and Excel to be connected with software, such as R or SAS, that can analyze data from those databases.
  2. The setup for ODBC on Windows is a little tricky, but worth it. Install NotesSQL, then add the following to your PATH (instructions here):
    1. c:\Program Files\lotus\notes
    2. c:\NotesSQL
  3. Follow the instructions here to set up the ODBC connection. There is also a set of instructions here. Essentially, you will run an application installed by the NotesSQL to set up the permissions to access the Lotus databases, and then use Microsoft's ODBC tool to set up a Data Source Name (DSN) to your Lotus mail file. Usually, your mail file will be named something like userid.nsf. In what follows, I have assumed that the DSN is "lotus" but you can use any name in the control panel.
  4. Start up R, and install/load the RODBC package. Set up a connection to the Lotus database.
  5. library(RODBC) ch <- odbcConnect("lotus")
  6. You may have to use sqlTables to find the right name of the table, but I found the database view _Mail_Threads_, so I used that. Consult the RODBC documentation for how to use the commands.
  7. foo <- sqlFetch(ch,"_Mail_Threads_")
  8. Here's where the real fun begins. foo is now a data frame with the sender, the date/time, and the subject line of your emails (INBOX and filed). So have some fun.
  9. # find out how many times somebody has ever sent you email, and plot it bar <- table(foo[,1]) # sort in reverse descending order bar <- bar[rev(order(bar))] barplot(bar,names.arg="") 
Say, is that a power law distribution?
Oh, don't forget to cleanup after yourself.
odbcClose(ch)

More on the petabyte milestone

One area I think can break the petabyte milestone soon if not today, is genomics research. Again, you have a relatively automated system as far as data collection and storage is concerned.

Posted via email from Randomjohn's posterous

Tuesday, July 27, 2010

In which I speculate about breaking through the petabyte milestone in clinical research

Spotfire (who owns a web data analysis package and now also S-Plus), recently posted on petabyte databases, and I started wondering if petabyte databases would come to clinical research. The examples they provided–Google, Large Hadron Collider, and World of Warcraft/Avatar–are nearly self-contained data production and analysis systems in the sense that nearly the whole part of the data collection and storage process is automated. This property allows the production of a large amount of high-quality data, and our technology has gotten to a point where petabyte databases are possible.

By contrast, clinical research has a lot of inherently manual processes. Even with electronic data capture, which has usually improved the collection of clinical data, the process still has enough manual parts to make the database fairly small. Right now, individual studies have clinical databases on the order of tens of megabytes, with the occasional gigabyte database if a lot of laboratory data is collected (which does have a bit more automation to it at least on the data accumulation and storage end). Large companies having a lot of products might have tens of terabytes of storage, but data analysis only occurs on a few gigabytes at a time at the most. At the FDA, this kind of scaling is more extreme as they have to analyze data from a lot of different companies on a lot of different products. I don't know how much storage they have, but I can imagine that they would have to have petabytes of storage, but on the single product scale the individual analyses focus on a few gigabytes at a time.

I don't think we will hit petabyte databases in clinical research until electronic medical records are the primary data collection source. And before that happens, I think the systems that are in place will have to be standardized, interoperable, and simply of higher quality than they are now. By then, we will look at the trails that Google and LHC have blazed.

 

Posted via email from Randomjohn's posterous

Friday, July 23, 2010

Information allergy

Frank Harrell gave a talk at this year's useR! conference on "information allergy." (I did not attend the conference, but it looks like I should have.) Information allergy, according to the abstract, is defined as a two-part problem, which exhibits a willful refusal to do what it takes to make good decisions:
  1. refusing to obtain key information to make a sound decision
  2. ignoring important available information
One of the major areas point out is the refusal to acknowledge "gray areas," which forces one into a false binary choice. I have observed this on many occasions, which is why I usually recommend the analysis of a continuous endpoint in conjunction with a binary endpoint.
At any rate, I look forward to reading the rest of the talk.
Update: the slides from a previous incarnation of the talk can be found here.