In order to move some of my personal interests along, I have been trying to implement the methodology found in Berry and Berry's article Accounting for Multiplicities in Assessing Drug Safety. This methodology uses the MedDRA hierarchy to improve the power of detecting damage to a particular organ. (The drawback, of course, is that MedDRA system organ classes do not perfectly correspond to the biology.) Apparently some other groups have already done so, but these implementations are hiding in paid software or on people's local drives, but doing my own implementation in R is a good learning experience.
I've been working on this project for some time now, off an on. Well, I've been making progress, and I'll share the results when I'm done. I'd also like to implement some of the other similar algorithms in this area, including the Poisson model that accounts for multiple occurrences of an adverse event, and a recent methodology that looks for "syndromes" (i.e. occurrences of groups of specific events all of which arise within a short time) and "constellations" (where the time restrictions are relaxed).
Friday, November 20, 2009
Working on a drug safety project
Posted by
Random John
at
1:03 PM
0
comments
Links to this post
Labels: Bayesian statistics, MedDRA, R, safety
Thursday, October 15, 2009
Free copy of Elements of Statistical Learning
The Stanford school of data mining has a free pdf of The Elements of Statistical Learning at the book's website. It's an excellent book, describing (and going a long way toward unifying) everything from regression to smoothing, tree-based methods, clustering, and so forth.
Posted by
Random John
at
8:56 AM
0
comments
Links to this post
Labels: books, data mining, free
Wednesday, September 23, 2009
Beginning with the end in mind when collecting data: having data vs. using data
One of Steven Covey's famous phrases is to begin with the end in mind. In my own work I have found that when planning clinical trials this adage is ignored when planning data collection and databases. Data collection is planned to have data, rather than use it.
I'll give a recent example. I was asked to calculate the number of days a person was supposed to take a drug. We had the start date and end date, and so it should have been easy to do end - start + 1. However, to complicate matters, we were asked to consider days when the investigator told the subject to lay off the drug. This data was collected as free text. So, for example, the data could show up as follows:
- 3 days beginning 9/22/2009
- 9/22/2009-9/24/2009
- 9/22-24/2009
- Sept 22, 2009 - Sept 24, 2009
But we should not have had to. One person reviewing the data collection with the knowledge that this data would have to be analyzed would have immediately and strongly recommended that the data be collected in a structured format for the statistician to analyze at the end of the trial.
It is with great interest that I note that this problem is much wider. This blog post suggests a possible reason: problems of the past had to do with hidden information or data, but modern problems have to do with problems hidden within data that is in plain sight (a hypothesis of Malcolm Gladwell and probably many others). That is, in the past, having the data was good enough. We did not have space to store huge amounts of data, and certainly not the processing power to sift through all of it. Now, we have the storage and the processing power, but our paradigm of thinking about data has not kept up. We are still thinking that all we need is to have it, when what we really need is to analyze it, discard what's irrelevant, and correctly interpret what is there.
And that's why Hal Varian regards statistics as the "sexy job" of the next decade.
Posted by
Random John
at
8:45 PM
0
comments
Links to this post
Labels: data collection, data mining, Hal Varian, statistics
Wednesday, August 26, 2009
The Bayesian information criterion (BIC) doesn't make sense to me
For fitting a model, several different criteria can be used. The first is the Aikake information criterion (AIC), which is basically -2*loglikelihood + 2*# parameters. So if you add a parameter to the model, it penalizes the AIC by 2, so you would need a commensurate decrease in -2*loglikelihood (so basically the loglikelihood would have to increase by at least 1) to make it worth adding.
The BIC penalizes the -2*loglikelihood by # parameters*log(sample size). So for large sample sizes, to add a parameter to the model you would have to improve the loglikelihood by a lot more. But, doesn't a richer sample allow you to explore more parameters for the model? So in the case where you are able to explore more parameters, the BIC forces you to use fewer. Doesn't make a lot of sense to me, although the loglikelihood does take on a larger range of values with a larger sample as it sums over the sample.
Posted by
Random John
at
7:19 AM
5
comments
Links to this post
Thursday, July 30, 2009
Meetings and makers
Paul Graham has a very interesting article on the schedules of the manager and maker. As someone expected to perform both functions at the same time I find the tension between the two rather intense. I often end up asking whether I will ever be able to do any of the work I have to meet about.
Bonus: Stephen Dubner's take. He's in a similar position, only serially.
Posted by
Random John
at
4:40 PM
0
comments
Links to this post
Saturday, July 11, 2009
Causal inference and biostatistics
I've been following the discussion on causal inference over at Gelman's blog with quite a bit of interest. Of course, this is in response to Judea Pearl's latest book on causal inference, which differs quite a bit from the theory that had been forwarded by Donald Rubin and his colleagues for the last 35 years or so.
This is a theory that I think deserves more attention in biostatistics. After all, it goes back to the root of why we are studying drugs. Ultimately, we really don't give a damn about whether outcomes are better in the treated group than in the placebo group. Rather, we are more interested in whether we are able to benefit individuals by giving them a treatment. In other words, we are interested in the unknowable quantity of what each person's outcome is if they are treated and what it is if they are not. If there's an improvement and it outweighs the (unknowable) risks, the drug is worth while. The reason we are interested in outcomes of a treated group and outcomes of a placebo group is that it's a surrogate for this unknowable quantity, especially if you use the intention-to-treat principle. However, as mentioned in the linked article and the research by Rubin, the intention to treat principle fails to deliver on its promise despite its simplicity and popularity.
Some clinical trials are now being run with causal inference as a central part of the design. Tools such as R and WinBUGS and Bayesian concepts now make this logistically feasible. Further advances in statistical handling of partial compliance to treatment, biological pathways of drugs, and the intention to treat principle itself make causal inference look much more desirable by the day. It's really only inertia caused by the popularity and (apparent) simplicity of intention to treat that makes this concept slower to catch on.
Posted by
Random John
at
4:12 PM
1 comments
Links to this post
Labels: Bayesian statistics, causal inference, intention to treat, R
Wednesday, July 1, 2009
PK/PD blogging
Well, it seems that for any topic, there is someone willing to blog it. In this particular case, that is a very good thing. Via Derek Lowe's excellent blog, I found someone blogging on pharmacokinetics and pharmacodynamics (PK/PD). Drug development is very inefficient as it stands right now, and PK/PD modeling and simulation is one up and coming way to make it more efficient. I'll look forward to seeing what the author has to say in the upcoming weeks.
Posted by
Random John
at
8:46 PM
1 comments
Links to this post
Labels: modeling, PD, PK, simulation
Sunday, June 21, 2009
Excellence in reporting statistics
Posted by
Random John
at
4:10 PM
0
comments
Links to this post
Thursday, June 18, 2009
Test post from Inference for R
I am testing out the Inference for R blogging tool. a <- 3 b<-3 c<-a+b print(c) Pretty neat.
[1] 6
Posted by
Random John
at
11:38 PM
0
comments
Links to this post
Sunday, June 14, 2009
Graphing the many dimensions of gay rights
Posted by
Random John
at
8:49 AM
0
comments
Links to this post
