Realizations in Biostatistics: Why I hate p-values (statistical leadership, Part II)

Monday, March 5, 2012

Why I hate p-values (statistical leadership, Part II)

One statistical tool is the ubiquitous p-value. If it’s less than 0.05, your hypothesis must be true, right? Think again.

Ok, so I don’t hate p-values, but I do hate the way that we abuse them. And here’s where we need statistical leadership to go back and critique these p-values before we get too excited.

P-values can make or break venture capital deals, product approval for drugs, or senior management approval for a new design of deck lid. In that way, we place a little too much trust in them. Here’s where we abuse them:

The magical 0.05: if we get a 0.51, we lose, and if we get a 0.49, we win! Never mind that the same experiment run under the same conditions can easily produce both of these results. (The difference between statistically significant and not significant is not significant.)
The misinterpretation: the p-value is not the probability of the null hypothesis being true, but rather the long-run relative frequency of times that data from the similar experiments run under the same conditions will produce a test statistic that is at least the value that you had in your experiment, if the null hypothesis is true. Got that? Well, no matter how small your p-value is, I can get a wimpy version of your treatment and get a smaller p-value, just by increasing the sample size to what I need. P-values depend on effect size, effect variance, and sample size.
The gaming of the p-value: in clinical trials it’s possible to make your p-value smaller by restricting your subject entry criteria to what brings out the treatment effect the most. This is not usually a problem, except to keep in mind that the rarified world of early phase clinical studies is different from the real world.
The unethical gaming of the p-value: this comes from retrospectively tweaking your subject population. I guess it’s ok if you don’t try to pass this off as real results, but rather as information for further study design, but you can’t expect any scientific validity to tweaking a study, its population, or its analysis after the results are in.
Covariate madness: covariates tend to decrease the p-value by partitioning the variation in drug effect. That’s great if you want to identify segments of your patient population. But if you do covariate selection and then report your p-value from the final model, you have a biased p-value.

Statisticians need to stay on top of these issues and advocate for the proper interpretation of p-values. Don’t leave it up to someone with an incomplete understanding of these tools.