Friday, November 20, 2009

Working on a drug safety project

In order to move some of my personal interests along, I have been trying to implement the methodology found in Berry and Berry's article Accounting for Multiplicities in Assessing Drug Safety. This methodology uses the MedDRA hierarchy to improve the power of detecting damage to a particular organ. (The drawback, of course, is that MedDRA system organ classes do not perfectly correspond to the biology.) Apparently some other groups have already done so, but these implementations are hiding in paid software or on people's local drives, but doing my own implementation in R is a good learning experience.

I've been working on this project for some time now, off an on. Well, I've been making progress, and I'll share the results when I'm done. I'd also like to implement some of the other similar algorithms in this area, including the Poisson model that accounts for multiple occurrences of an adverse event, and a recent methodology that looks for "syndromes" (i.e. occurrences of groups of specific events all of which arise within a short time) and "constellations" (where the time restrictions are relaxed).

Thursday, October 15, 2009

Free copy of Elements of Statistical Learning

The Stanford school of data mining has a free pdf of The Elements of Statistical Learning at the book's website. It's an excellent book, describing (and going a long way toward unifying) everything from regression to smoothing, tree-based methods, clustering, and so forth.

Wednesday, September 23, 2009

Beginning with the end in mind when collecting data: having data vs. using data

One of Steven Covey's famous phrases is to begin with the end in mind. In my own work I have found that when planning clinical trials this adage is ignored when planning data collection and databases. Data collection is planned to have data, rather than use it.

I'll give a recent example. I was asked to calculate the number of days a person was supposed to take a drug. We had the start date and end date, and so it should have been easy to do end - start + 1. However, to complicate matters, we were asked to consider days when the investigator told the subject to lay off the drug. This data was collected as free text. So, for example, the data could show up as follows:

  • 3 days beginning 9/22/2009
  • 9/22/2009-9/24/2009
  • 9/22-24/2009
  • Sept 22, 2009 - Sept 24, 2009
You get the picture. Because we are a group with finite time and not Google with 20% anything time, were were unable to parse all the different ways this data was recorded.

But we should not have had to. One person reviewing the data collection with the knowledge that this data would have to be analyzed would have immediately and strongly recommended that the data be collected in a structured format for the statistician to analyze at the end of the trial.

It is with great interest that I note that this problem is much wider. This blog post suggests a possible reason: problems of the past had to do with hidden information or data, but modern problems have to do with problems hidden within data that is in plain sight (a hypothesis of Malcolm Gladwell and probably many others). That is, in the past, having the data was good enough. We did not have space to store huge amounts of data, and certainly not the processing power to sift through all of it. Now, we have the storage and the processing power, but our paradigm of thinking about data has not kept up. We are still thinking that all we need is to have it, when what we really need is to analyze it, discard what's irrelevant, and correctly interpret what is there.

And that's why Hal Varian regards statistics as the "sexy job" of the next decade.

Wednesday, August 26, 2009

The Bayesian information criterion (BIC) doesn't make sense to me

For fitting a model, several different criteria can be used. The first is the Aikake information criterion (AIC), which is basically -2*loglikelihood + 2*# parameters. So if you add a parameter to the model, it penalizes the AIC by 2, so you would need a commensurate decrease in -2*loglikelihood (so basically the loglikelihood would have to increase by at least 1) to make it worth adding.

The BIC penalizes the -2*loglikelihood by # parameters*log(sample size). So for large sample sizes, to add a parameter to the model you would have to improve the loglikelihood by a lot more. But, doesn't a richer sample allow you to explore more parameters for the model? So in the case where you are able to explore more parameters, the BIC forces you to use fewer. Doesn't make a lot of sense to me, although the loglikelihood does take on a larger range of values with a larger sample as it sums over the sample.

Thursday, July 30, 2009

Meetings and makers

Paul Graham has a very interesting article on the schedules of the manager and maker. As someone expected to perform both functions at the same time I find the tension between the two rather intense. I often end up asking whether I will ever be able to do any of the work I have to meet about.

Bonus: Stephen Dubner's take. He's in a similar position, only serially.

Saturday, July 11, 2009

Causal inference and biostatistics

I've been following the discussion on causal inference over at Gelman's blog with quite a bit of interest. Of course, this is in response to Judea Pearl's latest book on causal inference, which differs quite a bit from the theory that had been forwarded by Donald Rubin and his colleagues for the last 35 years or so.

This is a theory that I think deserves more attention in biostatistics. After all, it goes back to the root of why we are studying drugs. Ultimately, we really don't give a damn about whether outcomes are better in the treated group than in the placebo group. Rather, we are more interested in whether we are able to benefit individuals by giving them a treatment. In other words, we are interested in the unknowable quantity of what each person's outcome is if they are treated and what it is if they are not. If there's an improvement and it outweighs the (unknowable) risks, the drug is worth while. The reason we are interested in outcomes of a treated group and outcomes of a placebo group is that it's a surrogate for this unknowable quantity, especially if you use the intention-to-treat principle. However, as mentioned in the linked article and the research by Rubin, the intention to treat principle fails to deliver on its promise despite its simplicity and popularity.

Some clinical trials are now being run with causal inference as a central part of the design. Tools such as R and WinBUGS and Bayesian concepts now make this logistically feasible. Further advances in statistical handling of partial compliance to treatment, biological pathways of drugs, and the intention to treat principle itself make causal inference look much more desirable by the day. It's really only inertia caused by the popularity and (apparent) simplicity of intention to treat that makes this concept slower to catch on.

Wednesday, July 1, 2009

PK/PD blogging

Well, it seems that for any topic, there is someone willing to blog it. In this particular case, that is a very good thing. Via Derek Lowe's excellent blog, I found someone blogging on pharmacokinetics and pharmacodynamics (PK/PD). Drug development is very inefficient as it stands right now, and PK/PD modeling and simulation is one up and coming way to make it more efficient. I'll look forward to seeing what the author has to say in the upcoming weeks.

Sunday, June 21, 2009

Excellence in reporting statistics

Sharon Begley of Newsweek received the Excellence in Statistical Reporting Award - Statistical Modeling, Causal Inference, and Social Science


I have to say, I didn't even know the ASA (a professional organization to which I belong and participate) gives out awards for reporting. With all the bad reporting involving statistics, it's refreshing to see a real effort for someone to try to help the public make sense of everything. Way to go Sharon!

Thursday, June 18, 2009

Test post from Inference for R

I am testing out the Inference for R blogging tool.

a <- 3

b<-3

c<-a+b

print(c)
[1] 6

Pretty neat.

Sunday, June 14, 2009

Graphing the many dimensions of gay rights

Gay Rights are Popular in Many Dimensions - Statistical Modeling, Causal Inference, and Social Science


In addition to having an interesting message, the graph in the article is the most well done I've seen in a long time. The data to ink ratio is extremely high, and the amount of sheer data presented is astounding. Yet, the graph is clear and easy to read.