Sunday, December 11, 2016

I set up a new data analysis blog

Well, I tried to write a blog post using the RStudio Rmarkdown system, and utterly failed. Thus, I set up a system where I could write from RStudio. So I set up a Github pages blog at There I can easily write and publish posts involving data analysis.

Saturday, September 24, 2016

Windows 10 anniversary updates includes a whole Linux layer - this is good news for data scientists

If you are on Windows 10, no doubt you have heard that Microsoft included the bash shell in its 2016 Windows 10 anniversary update. What you may not know is that this is much, much more than just the bash shell. This is a whole Linux layer that enables you to use Linux tools, and does away with a further layer like Cygwin (which requires a special dll). However, you will only get the bash shell out of the box. To enable the whole Linux layer, follow instructions here. Basically, this involves enabling developer mode then enabling the Linux layer feature. In the process, you will download some further software from the Windows store.

Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don't know if RStudio works with R installed in this way.

If you're a Linux buff who uses Windows, you can probably think of many more things  you can do with this. The only drawback is that I haven't tried using any sort of X Windows or other graphical interfaces.

Sunday, June 26, 2016

Which countries have Regrexit?

This doesn't have a lot to do with bio part of biostatistics, but is an interesting data analysis that I just started. In the wake of the Brexit vote, there is a petition for a redo. The data for the petition is here, in JSON format.

Fortunately, in R, working with JSON data is pretty easy. You can easily download the data from the link and put it into a data frame. I start on that here, with the RJSONIO package, ggplot2, and a version of the petition I downloaded on 6/26/16.

One question I had was whether all the signers are British. Fortunately, the petition collects the place of residence of the signer, assuming no fraud. I came up with the following top 9 non-UK countries of origin of signers.

There are a couple of things to remember when interpreting this graph:
  1. I left off the UK. The number of signatures is over 3 million, and contains by far the largest percentage of signatories.
  2. 5 of the 9 top countries are neighbors, including the top 2. The other 4 are Australia, the US, Canada, and New Zealand, who are all countries that have strong ties to the UK.
  3. This assumes no petition fraud, which I can't guarantee. I saw at least one Twitter posting telling people to use her (if the profile pic is to be believed) residence code. There is a section of the petition data showing constituency, so I'm wondering if it would be possible to analyze the petition for fraud. I'm not as familiar with British census data as I am with US, but I imagine a mashup of the two would be useful.
(Update: Greg Jefferis posted a nice analysis here. See comments below.)

Sunday, June 5, 2016

Little Debate: defining baseline

In an April 30, 2015 note in Nature (vol 520, p. 612), Jeffrey Leek and Roger Peng note that p-values get intense scrutiny, while all the decisions that lead up to the p-values get little debate. I wholeheartedly agree, and so I'm creating a Little Debate series to shine some light on these tiny decisions that may not get a lot of press. Yet these tiny decisions can have a big influence on statistical analysis. Because my focus here is mainly biostatistics, most of these ideas will be placed in the setting of clinical trials.

Defining baseline seems like an easy thing to do, and conceptually it is. Baseline is where you start before some intervention (e.g. treatment, or randomization to treatment or placebo). However, the details of the definition of baseline in a biostatistics setting can get tricky very quickly.

The missing baseline

Baseline is often defined as the value at a randomization or baseline visit, i.e. the last measurement before the beginning of some treatment or intervention. However, a lot of times things happen - a needle breaks, a machine stops working, or study staff just forget to do procedures or record times. (These are not just hypothetical cases ... these have all happened!) In these cases, we end up with a missing baseline. A missing baseline will make it impossible to determine the effect of an intervention for a given subject.

In this case, we have accepted that we can use previous values, such as those taken during the screening of a subject, as baseline values. This is probably the best we can do under the circumstances. However, I'm unaware of any research on what effect this has on statistical analysis.

To make matters worse, a lot of times people without statistical training or expertise will make these decisions, such as putting down a post-dose value as baseline. Even with good documentation, these sorts of mistakes are not easy to find, and, when they are, they are often found near the end of the study, right when data management and statisticians are trying to produce results, and sometimes after interim analyses.

The average baseline

Some protocols specify that baseline consists of the average of three repeated measurements. Again, this decision is often made before any statisticians are consulted. The issue with such a statistical analysis is that averages are not easily comparable to raw values. Let's say that a baseline QTc (a measure of how fast the heart charge recovers from a pump, corrected for heart rate) is defined based on 3 electrocardiogram (ECG) measurements. The standard deviation of a raw QTc measurement (i.e. based on one ECG), let's say, is s. The standard deviation of the average of those three (assuming independence) is s/√3, or just above half the standard deviation of the raw ECG. Thus, a change of 1 unit in the average of 3 ECGs is a lot more noteworthy than a change of 1 unit in a single ECG measurement. And yet we compare that to single measurements for the rest of the study.

To make matters worse, if the ECG machine screws up one measurement, then the baseline becomes the average of two. A lot of times we lose that kind of information, and yet analyze the data as if the mystery average is a raw measurement.

The extreme baseline

In one observational study, the sponsor wanted to use the maximum value over the last 12 months as a baseline. This was problematic for several reasons. Like the average, the extreme baseline (here the maximum) is on a different scale, and even has a different distribution, than the raw measurement. The Fisher-Tippett (extreme value) theorem states that the maximum of n values converges to one of three extreme value distributions (Gumbel, Frechet, or Weibull). These distributions are then being compared to, again, single measurements taken after baseline. What's worse, any number of measurements could have been taken for those subjects within 12 months, leading to a major case of shifting sands regarding the distribution of baseline.

Comparing an extreme value with a later singular measurement will lead to an unavoidable case of regression to the mean, thus creating an apparent trend in the data where none may exist. Without proper context, this may lead to overly optimistic interpretations of the effect of an intervention, and overly small p-values. (Note that a Bayesian analysis is not immune to the misleading conclusions that might arise from this terrible definition of baseline.)


The definition of baseline is a "tiny decision" that can have major consequences in a statistical analysis. Yet, the impact of this decision has not been well studied, especially in the context of a clinical trial where a wide range of definitions may be written into a protocol without the expert advice of a statistician. Even a definition that has been well-accepted -- that baseline is the last singular pre-dose value before intervention -- has not been well-studied in the scenario of missing baseline day measurement. Other decisions are often made without considering the impact on analysis, including some that may lead to wrong interpretations.

Friday, May 20, 2016

Simulating a Weibull conditional on time-to-event is greater than a given time

Recently, I had to simulate a time-to-event of subjects who have been on a study, are still ongoing at the time of a data cut, but who are still at risk of an event (e.g. progressive disease, cardiac event, death). This requires the simulation of a conditional Weibull. To do this, I created the following function:

# simulate conditional Weibull conditional on survival > T ---------------

# reliability function is exp{-(T+t/b)^a} / exp{-(T/b)^a} = 1-F(t)
# n = number of points to return
# shape = shape parm of weibull
# scale = scale parm of weibull (default 1)
# t is minimum (default is 0, which makes the function act like rweibull)
my_rweibull <- function(n,shape,scale=1,t=0) {
  if (length(t)!=1 && length(t)!=n) {
    stop("length(t) is not 1 or n")

You use this function just like rweibull, with the exception that you pass in another vector t of minimum times or a scalar representing the minimum time of all simulated values. The idea is that the probability that the random variable will be at least T is given by exp{-(T+t/b)^a} / exp{-(T/b)^a}, so you can simulate this with a uniform random variate. I use the inversion method on the reliability function (just like using the inversion method on the distribution function, with the insight that if U is uniform(0,1), so is 1-U).

Truth be told, I ought to buckle down and learn how to do packages in R, but for now I'll just pass on some code on my blog if I think it will be helpful (or if I need to find it while doing a Google search later).

(Edit on 7/16: OOPS! A previous version of this had the scale and shape parameters switched. I've corrected it now. If you copied this before 7/16/2016, please check again.)

Thursday, January 14, 2016

Talk to Upstate Data Science Group on Caret

Last night I gave an introduction and demo of the caret R package to the Upstate Data Science group, meeting at Furman University. It was fairly well attended (around 20 people), and well received.

It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.

We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.

Wednesday, November 25, 2015

Even the tiniest error messages can indicate an invalid statistical analysis

The other day, I was reading in a data set in R, and the function indicated that there was a warning about a parsing error on one line. I went ahead with the analysis anyway, but that small parsing error kept bothering me. I thought it was just one line of goofed up data, or perhaps a quote in the wrong place. I finally opened up the CSV file in a text editor, and found that the reason for the parsing error was that the data set was duplicated within the CSV file. The parsing error resulted from the reading of the header twice. As a result, anything I did afterward was suspect.

Word to the wise: track down the reasons for even the most innocuous-seeming warnings. Every stage of a statistical analysis is important, and small errors anywhere along the way and have huge consequences downstream. Perhaps this is obvious, but you still have to slow down and take care of the details.

(Note that I'm editing this to be a part of my Little Debate series, which discusses the tiny decisions dealing with data that are rarely discussed or scrutinized, but can have a major impact on conclusions.)

Thursday, October 22, 2015

Statisticians ruin the day again, this time with a retraction

Authors retract second study about medical uses of honey - Retraction Watch at Retraction Watch:

For the second time, authors of manuscripts have had to retract their papers because of serious data analysis errors. While the details of the actual errors are scant, we do know that the article was published, a company tried to replicate the results but failed, the journal editor employed a third-party statistician who found serious errors in the data analysis, and the errors were serious enough that the paper, to stay accepted, would have had to go through a major revision and further peer review.

Better to get a statistician to ruin your day before publication (and require another revision) than to have to eat crow because a third party did it. Other thoughts:

  • Did the authors have a statistician review this before submitting?
  • Did the journal have this go through a statistical peer review before accepting?
  • Come to think of it, was a statistician involved in the planning of this study?

Thursday, October 15, 2015

The thirty-day trial

Steve Pavlina wrote about a self-help technique called the thirty-day trial. To perform the technique, you commit 30 days of some new habit, such as quitting smoking or writing in a journal. The idea is that it’s psychologically easier to commit to something for 30 days than to make a permanent change, but after the 30 days you break addiction to old habits and have the perspective of whether to continue on with the new habit, go back, or go a completely different direction.

For activities like journaling or quitting smoking, this technique might work. After all, psychologist Jeremy Dean announced that it takes 21 days to form a new habit. (This has been treated with some due skepticism, which it should.) However, if you try a 30 day quit-smoking trial and it doesn’t work, try again. And if that doesn’t work, try it again. Until you succeed.

However, for activities such as a new diet, treatments, or any sort of healthcare advice, the 30 day trial should not be used.

For cases where one might try a 30 day trial just to see if it feels better, such as an elimination diet, confirmation bias is likely to be at play. This is especially true in the case of an elimination diet, where one eliminates some aspect of a diet, like dairy or gluten, and see if some kinds of symptoms, such as bloating or fatigue, go away. In such trials, the results may be due to the eliminated item in question, placebo/nocebo effect, or some third confounded eliminated item. For instance, bloating from gluten-containing foods probably comes from medium-length carbohydrates called FODMAPs. Just because you feel better for 30 days after discontinuing gluten-containing foods doesn’t mean that gluten is the culprit. In fact, it probably isn’t, as determined by a well-designed clinical trial. Likewise, eliminating milk from a diet isn’t likely to do too much unless lactose intolerance is the culprit, and there are tests for that.

Returning to Pavlina’s advice, he recommends a 30 day trial over the results of a clinical trial. This is sloppy and irresponsible advice. First, it is very unlikely that an individual will have access to screening assays, animal models, or the millions of dollars needed to discover and develop a drug. That is, unless said individual is up for trying millions of compounds, many potentially toxic and will probably tragically shorten the trial. Instead, a pharmaceutical should be used under the supervision of a doctor, who should be aware of the totality of literature (ideally consisting of randomized controlled trials if possible and a systematic review) and can navigate the benefits and possible adverse effects. A 30-day trial of a pharmaceutical or medical device may be completely inappropriate to realize benefits or assess risks. Here is the primary case where the plural of anecdote is not data.

The bottom line is that the 30 day trial is one of several ways (and perhaps not the best one) to change habits that you know from independent confirmation need to be done, like quitting smoking. It’s a terrible way to adjust a diet or determine if a course of treatment is the right one. Treatments should be based on science and under the supervision of a physician who can objectively determine whether a treatment is working correctly.

Friday, May 8, 2015

Statistics: P values are just the tip of the iceberg : Nature News & Comment

Statistics: P values are just the tip of the iceberg : Nature News & Comment:

This article is very important. Yes, p-values reported in the literature (or in your own research) need scrutiny, but so does every step in the analysis process, starting with the design of the study.