Realizations in Biostatistics: Dirty data processing in SAS and R

Friday, December 31, 2010

Dirty data processing in SAS and R

For many data management needs, I have typically relied on SAS. The data vector is perfectly suited to dealing with datasets on a record-by-record basis (which was why it was invented!), and sorting, subsetting, and merging datasets is a relatively easy task. (One of the questions I will ask SAS programmer candidates involves showing them a bit of code using a merge without a by statement. The trick is to see if they catch on.)

In a clean data environment, such as premarketing clinical trials, these operations are usually adequate, though even then it's sometimes hard to identify key variables in a dataset when the the database is not properly set up. However, I'm now moving into an environment where data from several sources has to be merged. SAS has some tools to handle this, such as fuzzy matching, hash tables, and regular expressions (supporting Perl regular expressions as of version 9). I find them rather cumbersome to use, though. Consider the following code (from a SUGI paper by Ron Cody):

DATA _NULL_;

TITLE "Perl Regular Expression Tutorial – Program 1";

IF _N_ = 1 THEN PATTERN_NUM = PRXPARSE("/cat/");

*Exact match for the letters 'cat' anywhere in the string;

RETAIN PATTERN_NUM;

INPUT STRING $30.;

POSITION = PRXMATCH(PATTERN_NUM,STRING);

FILE PRINT;

PUT PATTERN_NUM= STRING= POSITION=;

DATALINES;

There is a cat in this line.

Does not match CAT

cat in the beginning

At the end, a cat

cat

;

This is just to match using a Perl regular expression. The regular expression must be compiled (using the cumbersome IF _N_=1 device) and then the compiled regular expression referred in the body of the data step.

I think that SAS also support the SQL LIKE function, which determines if two words sound reasonably alike. However, I don't think it supports Levenshtein distance unless you buy the text analytics package. Because I haven't experimented with the text analytics package, I cannot say whether it is worth the money.

The brute force way to use these tools is to create a large dataset that is keyed by the direct product of the two datasets to merge. The regular expressions or whatever criteria you like can be used to filter out records. Other methods are possible, but I'm only getting started down this road, so I'll have to share more clever methods later.

In fuzzy matching cases, R is not so bad to use despite the lack of implicit looping that SAS has. Mostly, it's because matching and otherwise using regular expressions is much simpler than the implementation in SAS. (SAS BASE DEVELOPERS! DO YOU HEAR ME!) However, that's not really an improvement above the SAS Base implementation. There is the book Data Mashups in R, which shows some very interesting ideas for data cleaning and merging using web services (specifically, geocoding services), but their one manual merge had a very clean key.

Over the last few days, I've found a couple of really cool tools that seem to expand the possibilities of data cleaning and merging from different sources (or, so-called "mashups"). One is the Google Refine tool and the other the RecordLinkage package in R. Both hold a lot of promise, and I'll be trying both out in the near future.