Thursday, January 27, 2011

Coping with lower-truncated assay values in SAS and R

 

Many assays, for example ones designed to detect a certain antibody, cannot detect values below a certain level (called the “lower level of quantification,” or LLQ). Summarizing these is problematic, because if you summarize without these problematic values, you clearly estimate the mean too high and the standard deviation too low. However, if you substitute 0, you underestimate the mean and overestimate the standard deviation.

One solution is maximum likelihood. It is fairly straightforward to write down the loglikelihood function:

L(x|μ,σ,LLQ) = Πi=1nφ(xi;μ,σ)*Φ(LLQ;μ,σ)#(LLQ)

where φ() is the normal probability distribution function and Φ is the normal cumulative distribution function. What this likelihood essentially says is if the data point is real, use it (by contributing the value of the density function at that data value), but if it is below LLQ, we substitute in the probability of getting less than LLQ into the likelihood.

I looked around for a way to do this in SAS, and I found 3. The first is use SAS/IML with one of its optimization routines (such as nlpdd()). The second is to use SAS/OR with PROC NLP. The third way, and my preferred way at the moment, is to use SAS/STAT with PROC NLMIXED (SAS/STAT will be found at nearly every SAS installation, whereas IML and OR are not always).

To do this in SAS:

%let llq=1; * this is the lower detectable level ;

* read in the data in a useful way. I tried using the special missing value .L for LLQ but NLMIXED below didn’t like it. Substitute 0 with your value of choice if this doesn’t work;

proc FORMAT;
invalue llqi
'LLQ'=0
_OTHER_=[best.]
;
run;

DATA foo;
input dat :llqi.;
datalines;
1
LLQ
2
3
LLQ
;

* define the loglikelihood and find its maximum ;

proc NLMIXED data=foo;
parms mu=1 sigma=1;
ll=log(pdf('normal',dat,mu,sigma))*(dat>0)+log(cdf('normal',&llq,mu,sigma))*(dat=0);
model dat ~ general(ll);
run;

So the original purpose of NLMIXED is to fit nonlinear mixed models, but the optimization routines that power NLMIXED are very similar to the ones that power NLP. Second, the developers exposed those optimization routines to the user through the general() modifier in the model statement. So we define the loglikelihood data point by data point (so using φ(xi;μ,σ) if it is a normal data point or Φ(LLQ;μ,σ) if it is below LLQ) and then basically state that our data is distributed according to this loglikelihood.

To do this in R, the optimization is a little more direct:

dat <- c(1,NA,2,3,NA)

ll <- function(x,dat,llq) {

mu <- x[1]
sigma<-x[2]

return(sum(dnorm(dat[!is.na(dat)],mean=mu,sd=sigma,log=TRUE))+sum(is.na(dat))*log(pnorm(llq,mean=mu,sd=sigma))

}

optim(c(1,1),ll,control=list(fnscale=-1),dat=dat,llq=1)

So, we define the data, define the loglikelihood from more of a bird’s eye point of view (again, using the pdf at the real data values, and counting up the number of below LLQ and multiplying by the probability of below LLQ).

I checked, and they give the same answer. To get standard errors, compute the Hessian, but this should get you started. Both NLMIXED and optim should give those.

Another caveat: if your sample size is small and your distribution isn’t normal, you will have to work a bit harder and get the correct distribution to use in the likelihood.