Realizations in Biostatistics: January 2011

Thursday, January 27, 2011

Coping with lower-truncated assay values in SAS and R

Many assays, for example ones designed to detect a certain antibody, cannot detect values below a certain level (called the “lower level of quantification,” or LLQ). Summarizing these is problematic, because if you summarize without these problematic values, you clearly estimate the mean too high and the standard deviation too low. However, if you substitute 0, you underestimate the mean and overestimate the standard deviation.

One solution is maximum likelihood. It is fairly straightforward to write down the loglikelihood function:

L(x|μ,σ,LLQ) = Π_i=1ⁿφ(x_i;μ,σ)*Φ(LLQ;μ,σ)^#(LLQ)

where φ() is the normal probability distribution function and Φ is the normal cumulative distribution function. What this likelihood essentially says is if the data point is real, use it (by contributing the value of the density function at that data value), but if it is below LLQ, we substitute in the probability of getting less than LLQ into the likelihood.

I looked around for a way to do this in SAS, and I found 3. The first is use SAS/IML with one of its optimization routines (such as nlpdd()). The second is to use SAS/OR with PROC NLP. The third way, and my preferred way at the moment, is to use SAS/STAT with PROC NLMIXED (SAS/STAT will be found at nearly every SAS installation, whereas IML and OR are not always).

To do this in SAS:

%let llq=1; * this is the lower detectable level ;

* read in the data in a useful way. I tried using the special missing value .L for LLQ but NLMIXED below didn’t like it. Substitute 0 with your value of choice if this doesn’t work;

proc FORMAT;
invalue llqi
'LLQ'=0
_OTHER_=[best.]
;
run;

DATA foo;
input dat :llqi.;
datalines;
1
LLQ
2
3
LLQ
;

* define the loglikelihood and find its maximum ;

proc NLMIXED data=foo;
parms mu=1 sigma=1;
ll=log(pdf('normal',dat,mu,sigma))*(dat>0)+log(cdf('normal',&llq,mu,sigma))*(dat=0);
model dat ~ general(ll);
run;

So the original purpose of NLMIXED is to fit nonlinear mixed models, but the optimization routines that power NLMIXED are very similar to the ones that power NLP. Second, the developers exposed those optimization routines to the user through the general() modifier in the model statement. So we define the loglikelihood data point by data point (so using φ(x_i;μ,σ) if it is a normal data point or Φ(LLQ;μ,σ) if it is below LLQ) and then basically state that our data is distributed according to this loglikelihood.

To do this in R, the optimization is a little more direct:

dat <- c(1,NA,2,3,NA)

ll <- function(x,dat,llq) {

mu <- x[1]
sigma<-x[2]

return(sum(dnorm(dat[!is.na(dat)],mean=mu,sd=sigma,log=TRUE))+sum(is.na(dat))*log(pnorm(llq,mean=mu,sd=sigma))

}

optim(c(1,1),ll,control=list(fnscale=-1),dat=dat,llq=1)

So, we define the data, define the loglikelihood from more of a bird’s eye point of view (again, using the pdf at the real data values, and counting up the number of below LLQ and multiplying by the probability of below LLQ).

I checked, and they give the same answer. To get standard errors, compute the Hessian, but this should get you started. Both NLMIXED and optim should give those.

Another caveat: if your sample size is small and your distribution isn’t normal, you will have to work a bit harder and get the correct distribution to use in the likelihood.

Tuesday, January 25, 2011

Tables to graphs

Here is an interesting site on how to convert tables that make your eyes bleed to more pleasant and interpretable graphs. It seems to have more of a political science focus (and links extensively to books by Andrew Gelman), and I think the principles apply to most any area.

Sunday, January 23, 2011

Status of postmarketing commitments by product approval year

Status of post-marketing commitments by year of product approval

Because it didn't publish correctly, here's the key:

P-Pending
D-Delayed
F-Fulfilled
O-Ongoing
R-Released (i.e. FDA no longer requires it
S-Submitted
T-Terminated

Friday, January 21, 2011

List of statistics blogs

Here is a list of 40 statistics blogs. I read many of them regularly, but there are a few that are new to me.

Ultraedit to R

My favorite text editor on Windows is Ultraedit, but it does not have a nice interface to R in the same vein as Emacs/ESS, Tinn-R, or Eclipse. (I have never used Eclipse.) Ultraedit is powerful enough to submit whole R programs and even capture the output, but you cannot submit pieces and parts of programs.

Until now. After one night of dusting off my C skills and learning a bit about how to use winsock, I have created a little command line tool that will send code to R and get back the output. It's very rough around the edges, but it works.

You may also be interested in this forum thread.

Please read the README before using, as it will give you some minimal instructions on setup. Notably, the svSocket package is required (it's part of Sciviews, linked above, and available on CRAN).

If you use Ultraedit and R, I think you will find it helpful even with the rough edges. Feedback is appreciated, either here or the linked thread.

Saturday, January 15, 2011

Dona eis Python, whap!

Well, I'm taking the plunge and learning Python. We'll see how this goes. Then I'll try NumPy (and SciPy, if it gets ported), and see of I can get R and Python talking.

Saturday, January 1, 2011

Top Content of 2010

According to Google Analytics, my top content of 2010 was

Content	Pageviews
O'Brien Fleming Designs In Practice	1311
My O'Brien Fleming Design Is Not Same As Your O'Brien-Fleming Design	792
Matrix Square Roots In R	473
Barnard's Exact Test: Test That Ought To Be Used More	212
Perverse Incentives In Clinical Trials	188
Bayesian Information Criterion	141
Review Of Jim Albert's Bayesian Computing in R	126
NNT Returns	107

Note that views of the front page are excluded, but as that content rotates it is hard to interpret that. I have put some easy feedback mechanisms for each post (including the rating scale), but these have not returned very much information.

None of this surprises me. It seems like O'Brien Fleming designs are the top content just about any time I check, and I think this is because adaptive trials are becoming more popular. The FDA's draft guidance on the subject was released, and I guess O'Brien-Fleming designs are the most popular type of adaptive trial at least in the confirmatory stages of drug development.

I'm still surprised that matrix square roots in R is next, because the Denman-Beavers algorithm is fairly inefficient unless you need the square root and its inverse. Some method based on eigenvalues or singular value decomposition is probably better, but I guess occasionally you run across the odd matrix that has a rank 2 or higher subspace in its eigendecomposition. I'm glad to see Barnard's test getting a little exposure if at least to challenged the flawed overapplication of Fisher's exact test. I'm surprised Perverse Incentives and NNT did not rank higher, as those topics I think are of wider interest, but there you go. The BIC is a fairly specialized topic, and while model selection is of wide interest the methods I used that involved the BIC are rather specialized.

As a note, I write the review of Jim Albert's book before the second edition came out. I have the second edition now, and I should probably get around to updating the review.

As any reader of this blog might be able to tell, my focus is probably going to change a bit as I get more comfortable in the post-marketing world of drug development. The issues change, the quality and kinds of data change, and so my day-to-day issues will change. We'll see how it unfolds.

Of course, I'm open to suggestions as well.

Happy New Year 2011

Thank you Bill Watterson!

This year, 2011, is a prime number, and the first prime numbered-year since 2003. It is also the sum of a prime number (11) of consecutive primes. More numerology here.

Last year was a year of changes, some of which were not easy but all of which have produced good results 2011 will probably be more about strengthening current initiatives (including the blogging one!) than about starting new things, but you never know. The only thing I can imagine along those lines would be doing some freelance consulting here and there.

It's worth noting that my main goals are process-oriented and so really are more intentions than resolutions, and that strategy has worked fairly well. (The main one that's results-oriented is going on a big vacation for my 10th anniversary.)

At any rate, I'm an analyst in an increasingly analytical world, so look for more analysis!