Wednesday, August 8, 2012

The problem of multiple comparisons

John Cook's discussion at Wrong and unnecessary — The Endeavour and the comments to that post are all worth reading (I rarely say that about comments to a post). The post is ostensibly on whether a linear model is useful even though it is no, in the perfect sense of the word, correct. In the comments, the idea of multiple comparisons is brought up, and not just whether they are appropriate but to what extent comparisons must be adjusted.

(For those not familiar with the problem, the Wikipedia article is worth reading. Essentially, multiple comparisons is a problem in statistics where you have a greater chance of declaring statistical significance if you compare multiple endpoints naively, or compare the same one over and over as you collect more data. Statisticians have many methods for adjusting for this effect depending on the situation.)

In pivotal trials, usually only the primary endpoints have multiple comparisons applied, and sometimes the multiple comparisons are applied separately to secondary endpoints. (I find this practice bizarre, though I have heard it endorsed at least once by a reviewing team at the FDA.) Biostatisticians have complained that testing covariate imbalances at baseline (i.e. did people entering into treatment and placebo group have the same distribution of ages?) add to the multiple comparisons problem, even though these baseline tests do not directly lead to some declaration of significance. Bayesian methods do not necessarily "solve" the multiple comparisons problem, but rather account appropriately for multiple testing when looking at data multiple times or looking at multiple endpoints, if the methodology is set up appropriately.

Multiple comparison methods tend to break down in situations of a large number of experiments or dimensions, such as "A/B" experiments that web sites tend to run or testing for the existence of adverse events of a drug, where sponsors tend to want to see a large number of Fisher's Exact tests. (In fact, the second situation suffers from the more fundamental problem that committing a Type I error - declaring the existence of an adverse event where there is none - is more conservative than committing a Type II error. Multiple comparison adjustments can lead to erroneous assurances of safety, while not adjusting can lead to lots of additional research confirming or denying the existence of significant Fisher's tests.)

I have even heard the bizarre assertion that multiple comparison adjustments are required when discussing several tests on one biological process, but then you get another multiple comparison adjustment when you do tests on another process even in the same study.

I assert that while we have a lot of clever methods for multiple comparisons, we have bizarre and arbitrary rules for when to apply them. Multiple comparison methodologies (simple, complex, or modern) control Type I error rate or similar measures over the tests to which they are applied, so I think that the coverage of these tests needs to be justified scientifically beyond such arbitrary rules.

'via Blog this'