Friday, December 31, 2010

Dirty data processing in SAS and R

For many data management needs, I have typically relied on SAS. The data vector is perfectly suited to dealing with datasets on a record-by-record basis (which was why it was invented!), and sorting, subsetting, and merging datasets is a relatively easy task. (One of the questions I will ask SAS programmer candidates involves showing them a bit of code using a merge without a by statement. The trick is to see if they catch on.)
 
In a clean data environment, such as premarketing clinical trials, these operations are usually adequate, though even then it's sometimes hard to identify key variables in a dataset when the the database is not properly set up. However, I'm now moving into an environment where data from several sources has to be merged. SAS has some tools to handle this, such as fuzzy matching, hash tables, and regular expressions (supporting Perl regular expressions as of version 9). I find them rather cumbersome to use, though. Consider the following code (from a SUGI paper by Ron Cody):
 
DATA _NULL_;
 TITLE "Perl Regular Expression Tutorial – Program 1";
 IF _N_ = 1 THEN PATTERN_NUM = PRXPARSE("/cat/");
 *Exact match for the letters 'cat' anywhere in the string;
 RETAIN PATTERN_NUM;
 INPUT STRING $30.;
 POSITION = PRXMATCH(PATTERN_NUM,STRING);
 FILE PRINT;
 PUT PATTERN_NUM= STRING= POSITION=;
DATALINES;
There is a cat in this line.
Does not match CAT
cat in the beginning
At the end, a cat
cat
;

This is just to match using a Perl regular expression. The regular expression must be compiled (using the cumbersome IF _N_=1 device) and then the compiled regular expression referred in the body of the data step.
 
I think that SAS also support the SQL LIKE function, which determines if two words sound reasonably alike. However, I don't think it supports Levenshtein distance unless you buy the text analytics package. Because I haven't experimented with the text analytics package, I cannot say whether it is worth the money.
 
The brute force way to use these tools is to create a large dataset that is keyed by the direct product of the two datasets to merge. The regular expressions or whatever criteria you like can be used to filter out records. Other methods are possible, but I'm only getting started down this road, so I'll have to share more clever methods later.
 
In fuzzy matching cases, R is not so bad to use despite the lack of implicit looping that SAS has. Mostly, it's because matching and otherwise using regular expressions is much simpler than the implementation in SAS. (SAS BASE DEVELOPERS! DO YOU HEAR ME!) However, that's not really an improvement above the SAS Base implementation. There is the book Data Mashups in R, which shows some very interesting ideas for data cleaning and merging using web services (specifically, geocoding services), but their one manual merge had a very clean key.
 
Over the last few days, I've found a couple of really cool tools that seem to expand the possibilities of data cleaning and merging from different sources (or, so-called "mashups"). One is the Google Refine tool and the other the RecordLinkage package in R. Both hold a lot of promise, and I'll be trying both out in the near future.

Friday, December 24, 2010

Miscellaneous

A few items:


  • I'm still settling into the new job. I'll be back to writing substantive posts probably early in the new year. The new company doesn't have a strong social media presence, so I will probably advocate for a stronger strategy.
  • Google has recognized the growing number of mobile devices accessing Blogger blogs, so has developed a mobile view. All we bloggers have to do is flip a switch. I flipped the switch, so if you are on your Android, iPhone, or other phone, enjoy. (I think Wordpress blogs have been doing this for a while so it's about time. I had looked into services such as Mobify but it was getting more complicated than I wanted.)
  • I'm off to play Santa Claus, so enjoy whatever holiday (or just time off) that you celebrate.

Monday, December 6, 2010

Open learning page on statistics

I found the open learning page on statistics via Gelman's blog, and it looks to be interesting.