Sunday, June 26, 2011

Watch this space: Data Without Borders

Data scientist Jake Porway is launching a new service called Data Without Borders, which seeks to match statisticians and nonprofits / NGOs:

If you are a data scientist / statistician who wants to do some good, check it out! I think he is also looking for a web designer, and I imagine there will need to be some coordination and management beyond Jake’s initial inspiration.

Sunday, March 27, 2011

More on Ultraedit to R

I think I have found a superior solution to integrating Ultraedit and R. See my post over at IDM Users forum.

Sunday, February 27, 2011

Who tweets about pharma?

I have been playing around with the Python language for a month or two, and really enjoy it. Specifically, I have been using Matthew Russell’s book Mining the Social Web (no financial relation here, just a happy customer) to find out how to do my own custom social network analysis. I’ve really only dipped my toes in it, but even the simple stuff can yield great insights. (Perhaps that’s the greatest insight I’m getting from this book!)

At any rate, I basically compiled all the material from Chapter 1 into one program, and searched for the term #pharma. I came up with the following:

Search term: #pharma
Total words: 7685
Total unique words: 3142
Lexical diversity: 0.408848
Avg words per tweet: 15.370000
The 50 most frequent words are
[u'#pharma', u'to', u'-', u'RT', u'#Pharma', u'in', u'for', u'the', u'and', u'of
', u':', u'a', u'on', u'&', u'#Job', u'is', u'#jobs', u'Pharma', u'#health',
u'#healthinnovations', u'\u2013', u'The', u'at', u'#hcmktg', u'Pfizer', u'from'
, u'US', u'with', u'#biotech', u'#nhs', u'de', u'#drug', u'drug', u'
http://bit.ly/PHARMA', u'this', u'#advertising', u'...', u'FDA', u'#Roche', u'#healthjobs',
u'an', u'Medical', u'that', u'AdPharm', u'them', u'you', u'#medicament', u'Big',
u'ouch', u'post']
The 50 least frequent words are
[u'verteidigen', u'video', u'videos,', u'vids', u'viral', u'visiting', u'vom', u
'waiting', u'wave', u'way:', u'weakest', u'website.', u'week', u'weekend,', u'we
ekly', u'werden', u'whether', u'wholesaler', u'willingness', u'winning...', u'wi
sh', u'work?', u'workin...', u'working', u'works', u'works:', u'world', u'worldw
ide-', u'write', u'wundern', u'www.adecco.de/offic...', u'year', u'years,', u'ye
t?', u"you'll", u'you?', u'youtube', u'you\u2019re', u'zone', u'zu', u'zur', u'{
patients}!?', u'\x96', u'\xa360-80K', u'\xbb', u'\xc0', u'\xe9cart\xe9s', u'\xe9
t\xe9', u'\xe9t\xe9...', u'\xfcber']
Number of users involved in RT relationships for #pharma: 158
Number of RT relationships: 108
Average number of RTs per user: 1.462963
Number of subgraphs: 52

(The term u'<stuff>' above simply means that the words in quotes are stored in unicode, which you can safely ignore here.)

As for graphing the retweet relationships, I came up with the following:

graph-pharma-2011-02-27

So whom should you follow if you want pharma news? Probably you can start with those tweeters at the centers of wheels with a lot of spokes. What are #pharma tweeters concerned about these days? Jobs, jobs, jobs, and maybe some innovations.

Update: I checked the picture and saw that it was downsized into illegibility. So I uploaded an svg version.

Thursday, January 27, 2011

Coping with lower-truncated assay values in SAS and R

 

Many assays, for example ones designed to detect a certain antibody, cannot detect values below a certain level (called the “lower level of quantification,” or LLQ). Summarizing these is problematic, because if you summarize without these problematic values, you clearly estimate the mean too high and the standard deviation too low. However, if you substitute 0, you underestimate the mean and overestimate the standard deviation.

One solution is maximum likelihood. It is fairly straightforward to write down the loglikelihood function:

L(x|μ,σ,LLQ) = Πi=1nφ(xi;μ,σ)*Φ(LLQ;μ,σ)#(LLQ)

where φ() is the normal probability distribution function and Φ is the normal cumulative distribution function. What this likelihood essentially says is if the data point is real, use it (by contributing the value of the density function at that data value), but if it is below LLQ, we substitute in the probability of getting less than LLQ into the likelihood.

I looked around for a way to do this in SAS, and I found 3. The first is use SAS/IML with one of its optimization routines (such as nlpdd()). The second is to use SAS/OR with PROC NLP. The third way, and my preferred way at the moment, is to use SAS/STAT with PROC NLMIXED (SAS/STAT will be found at nearly every SAS installation, whereas IML and OR are not always).

To do this in SAS:

%let llq=1; * this is the lower detectable level ;

* read in the data in a useful way. I tried using the special missing value .L for LLQ but NLMIXED below didn’t like it. Substitute 0 with your value of choice if this doesn’t work;

proc FORMAT;
invalue llqi
'LLQ'=0
_OTHER_=[best.]
;
run;

DATA foo;
input dat :llqi.;
datalines;
1
LLQ
2
3
LLQ
;

* define the loglikelihood and find its maximum ;

proc NLMIXED data=foo;
parms mu=1 sigma=1;
ll=log(pdf('normal',dat,mu,sigma))*(dat>0)+log(cdf('normal',&llq,mu,sigma))*(dat=0);
model dat ~ general(ll);
run;

So the original purpose of NLMIXED is to fit nonlinear mixed models, but the optimization routines that power NLMIXED are very similar to the ones that power NLP. Second, the developers exposed those optimization routines to the user through the general() modifier in the model statement. So we define the loglikelihood data point by data point (so using φ(xi;μ,σ) if it is a normal data point or Φ(LLQ;μ,σ) if it is below LLQ) and then basically state that our data is distributed according to this loglikelihood.

To do this in R, the optimization is a little more direct:

dat <- c(1,NA,2,3,NA)

ll <- function(x,dat,llq) {

mu <- x[1]
sigma<-x[2]

return(sum(dnorm(dat[!is.na(dat)],mean=mu,sd=sigma,log=TRUE))+sum(is.na(dat))*log(pnorm(llq,mean=mu,sd=sigma))

}

optim(c(1,1),ll,control=list(fnscale=-1),dat=dat,llq=1)

So, we define the data, define the loglikelihood from more of a bird’s eye point of view (again, using the pdf at the real data values, and counting up the number of below LLQ and multiplying by the probability of below LLQ).

I checked, and they give the same answer. To get standard errors, compute the Hessian, but this should get you started. Both NLMIXED and optim should give those.

Another caveat: if your sample size is small and your distribution isn’t normal, you will have to work a bit harder and get the correct distribution to use in the likelihood.

Tuesday, January 25, 2011

Tables to graphs

Here is an interesting site on how to convert tables that make your eyes bleed to more pleasant and interpretable graphs. It seems to have more of a political science focus (and links extensively to books by Andrew Gelman), and I think the principles apply to most any area.

Sunday, January 23, 2011

Status of postmarketing commitments by product approval year


Because it didn't publish correctly, here's the key:

  • P-Pending
  • D-Delayed
  • F-Fulfilled
  • O-Ongoing
  • R-Released (i.e. FDA no longer requires it
  • S-Submitted
  • T-Terminated

Friday, January 21, 2011

List of statistics blogs

Here is a list of 40 statistics blogs. I read many of them regularly, but there are a few that are new to me.

Ultraedit to R

My favorite text editor on Windows is Ultraedit, but it does not have a nice interface to R in the same vein as Emacs/ESSTinn-R, or Eclipse. (I have never used Eclipse.) Ultraedit is powerful enough to submit whole R programs and even capture the output, but you cannot submit pieces and parts of programs.

Until now. After one night of dusting off my C skills and learning a bit about how to use winsock, I have created a little command line tool that will send code to R and get back the output. It's very rough around the edges, but it works.

You may also be interested in this forum thread.

Please read the README before using, as it will give you some minimal instructions on setup. Notably, the svSocket package is required (it's part of Sciviews, linked above, and available on CRAN).

If you use Ultraedit and R, I think you will find it helpful even with the rough edges. Feedback is appreciated, either here or the linked thread.