Wednesday, July 28, 2010

Using ODBC and R to analyze Lotus Notes databases (including email)

For several reasons, I want to analyze data that comes out of Lotus Notes. One kind of such data, of course, is email. So here's how I did it. This requires MS Windows, but it may be possible to do this on Linux and Mac, because IBM supports those platforms as well. Also, I'm sure that other solutions exist for other email platforms, but I won't go into that here.
  1. Download NotesSQL, which is an ODBC (open database connectivity) driver for Lotus Notes. In a nutshell, ODBC allows most kind of databases, such as Oracle, MySQL, or even Microsoft Access and Excel to be connected with software, such as R or SAS, that can analyze data from those databases.
  2. The setup for ODBC on Windows is a little tricky, but worth it. Install NotesSQL, then add the following to your PATH (instructions here):
    1. c:\Program Files\lotus\notes
    2. c:\NotesSQL
  3. Follow the instructions here to set up the ODBC connection. There is also a set of instructions here. Essentially, you will run an application installed by the NotesSQL to set up the permissions to access the Lotus databases, and then use Microsoft's ODBC tool to set up a Data Source Name (DSN) to your Lotus mail file. Usually, your mail file will be named something like userid.nsf. In what follows, I have assumed that the DSN is "lotus" but you can use any name in the control panel.
  4. Start up R, and install/load the RODBC package. Set up a connection to the Lotus database.
  5. library(RODBC) ch <- odbcConnect("lotus")
  6. You may have to use sqlTables to find the right name of the table, but I found the database view _Mail_Threads_, so I used that. Consult the RODBC documentation for how to use the commands.
  7. foo <- sqlFetch(ch,"_Mail_Threads_")
  8. Here's where the real fun begins. foo is now a data frame with the sender, the date/time, and the subject line of your emails (INBOX and filed). So have some fun.
  9. # find out how many times somebody has ever sent you email, and plot it bar <- table(foo[,1]) # sort in reverse descending order bar <- bar[rev(order(bar))] barplot(bar,names.arg="") 
Say, is that a power law distribution?
Oh, don't forget to cleanup after yourself.