Sunday, June 5, 2016

Little Debate: defining baseline

In an April 30, 2015 note in Nature (vol 520, p. 612), Jeffrey Leek and Roger Peng note that p-values get intense scrutiny, while all the decisions that lead up to the p-values get little debate. I wholeheartedly agree, and so I'm creating a Little Debate series to shine some light on these tiny decisions that may not get a lot of press. Yet these tiny decisions can have a big influence on statistical analysis. Because my focus here is mainly biostatistics, most of these ideas will be placed in the setting of clinical trials.

Defining baseline seems like an easy thing to do, and conceptually it is. Baseline is where you start before some intervention (e.g. treatment, or randomization to treatment or placebo). However, the details of the definition of baseline in a biostatistics setting can get tricky very quickly.

The missing baseline

Baseline is often defined as the value at a randomization or baseline visit, i.e. the last measurement before the beginning of some treatment or intervention. However, a lot of times things happen - a needle breaks, a machine stops working, or study staff just forget to do procedures or record times. (These are not just hypothetical cases ... these have all happened!) In these cases, we end up with a missing baseline. A missing baseline will make it impossible to determine the effect of an intervention for a given subject.

In this case, we have accepted that we can use previous values, such as those taken during the screening of a subject, as baseline values. This is probably the best we can do under the circumstances. However, I'm unaware of any research on what effect this has on statistical analysis.

To make matters worse, a lot of times people without statistical training or expertise will make these decisions, such as putting down a post-dose value as baseline. Even with good documentation, these sorts of mistakes are not easy to find, and, when they are, they are often found near the end of the study, right when data management and statisticians are trying to produce results, and sometimes after interim analyses.

The average baseline

Some protocols specify that baseline consists of the average of three repeated measurements. Again, this decision is often made before any statisticians are consulted. The issue with such a statistical analysis is that averages are not easily comparable to raw values. Let's say that a baseline QTc (a measure of how fast the heart charge recovers from a pump, corrected for heart rate) is defined based on 3 electrocardiogram (ECG) measurements. The standard deviation of a raw QTc measurement (i.e. based on one ECG), let's say, is s. The standard deviation of the average of those three (assuming independence) is s/√3, or just above half the standard deviation of the raw ECG. Thus, a change of 1 unit in the average of 3 ECGs is a lot more noteworthy than a change of 1 unit in a single ECG measurement. And yet we compare that to single measurements for the rest of the study.

To make matters worse, if the ECG machine screws up one measurement, then the baseline becomes the average of two. A lot of times we lose that kind of information, and yet analyze the data as if the mystery average is a raw measurement.

The extreme baseline

In one observational study, the sponsor wanted to use the maximum value over the last 12 months as a baseline. This was problematic for several reasons. Like the average, the extreme baseline (here the maximum) is on a different scale, and even has a different distribution, than the raw measurement. The Fisher-Tippett (extreme value) theorem states that the maximum of n values converges to one of three extreme value distributions (Gumbel, Frechet, or Weibull). These distributions are then being compared to, again, single measurements taken after baseline. What's worse, any number of measurements could have been taken for those subjects within 12 months, leading to a major case of shifting sands regarding the distribution of baseline.

Comparing an extreme value with a later singular measurement will lead to an unavoidable case of regression to the mean, thus creating an apparent trend in the data where none may exist. Without proper context, this may lead to overly optimistic interpretations of the effect of an intervention, and overly small p-values. (Note that a Bayesian analysis is not immune to the misleading conclusions that might arise from this terrible definition of baseline.)


The definition of baseline is a "tiny decision" that can have major consequences in a statistical analysis. Yet, the impact of this decision has not been well studied, especially in the context of a clinical trial where a wide range of definitions may be written into a protocol without the expert advice of a statistician. Even a definition that has been well-accepted -- that baseline is the last singular pre-dose value before intervention -- has not been well-studied in the scenario of missing baseline day measurement. Other decisions are often made without considering the impact on analysis, including some that may lead to wrong interpretations.