Sunday, December 11, 2016

I set up a new data analysis blog

Well, I tried to write a blog post using the RStudio Rmarkdown system, and utterly failed. Thus, I set up a system where I could write from RStudio. So I set up a Github pages blog at randomjohn.github.io. There I can easily write and publish posts involving data analysis.

Saturday, September 24, 2016

Windows 10 anniversary updates includes a whole Linux layer - this is good news for data scientists

If you are on Windows 10, no doubt you have heard that Microsoft included the bash shell in its 2016 Windows 10 anniversary update. What you may not know is that this is much, much more than just the bash shell. This is a whole Linux layer that enables you to use Linux tools, and does away with a further layer like Cygwin (which requires a special dll). However, you will only get the bash shell out of the box. To enable the whole Linux layer, follow instructions here. Basically, this involves enabling developer mode then enabling the Linux layer feature. In the process, you will download some further software from the Windows store.

Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don't know if RStudio works with R installed in this way.

If you're a Linux buff who uses Windows, you can probably think of many more things  you can do with this. The only drawback is that I haven't tried using any sort of X Windows or other graphical interfaces.

Sunday, June 26, 2016

Which countries have Regrexit?

This doesn't have a lot to do with bio part of biostatistics, but is an interesting data analysis that I just started. In the wake of the Brexit vote, there is a petition for a redo. The data for the petition is here, in JSON format.

Fortunately, in R, working with JSON data is pretty easy. You can easily download the data from the link and put it into a data frame. I start on that here, with the RJSONIO package, ggplot2, and a version of the petition I downloaded on 6/26/16.

One question I had was whether all the signers are British. Fortunately, the petition collects the place of residence of the signer, assuming no fraud. I came up with the following top 9 non-UK countries of origin of signers.

There are a couple of things to remember when interpreting this graph:
  1. I left off the UK. The number of signatures is over 3 million, and contains by far the largest percentage of signatories.
  2. 5 of the 9 top countries are neighbors, including the top 2. The other 4 are Australia, the US, Canada, and New Zealand, who are all countries that have strong ties to the UK.
  3. This assumes no petition fraud, which I can't guarantee. I saw at least one Twitter posting telling people to use her (if the profile pic is to be believed) residence code. There is a section of the petition data showing constituency, so I'm wondering if it would be possible to analyze the petition for fraud. I'm not as familiar with British census data as I am with US, but I imagine a mashup of the two would be useful.
(Update: Greg Jefferis posted a nice analysis here. See comments below.)

Sunday, June 5, 2016

Little Debate: defining baseline

In an April 30, 2015 note in Nature (vol 520, p. 612), Jeffrey Leek and Roger Peng note that p-values get intense scrutiny, while all the decisions that lead up to the p-values get little debate. I wholeheartedly agree, and so I'm creating a Little Debate series to shine some light on these tiny decisions that may not get a lot of press. Yet these tiny decisions can have a big influence on statistical analysis. Because my focus here is mainly biostatistics, most of these ideas will be placed in the setting of clinical trials.

Defining baseline seems like an easy thing to do, and conceptually it is. Baseline is where you start before some intervention (e.g. treatment, or randomization to treatment or placebo). However, the details of the definition of baseline in a biostatistics setting can get tricky very quickly.

The missing baseline

Baseline is often defined as the value at a randomization or baseline visit, i.e. the last measurement before the beginning of some treatment or intervention. However, a lot of times things happen - a needle breaks, a machine stops working, or study staff just forget to do procedures or record times. (These are not just hypothetical cases ... these have all happened!) In these cases, we end up with a missing baseline. A missing baseline will make it impossible to determine the effect of an intervention for a given subject.

In this case, we have accepted that we can use previous values, such as those taken during the screening of a subject, as baseline values. This is probably the best we can do under the circumstances. However, I'm unaware of any research on what effect this has on statistical analysis.

To make matters worse, a lot of times people without statistical training or expertise will make these decisions, such as putting down a post-dose value as baseline. Even with good documentation, these sorts of mistakes are not easy to find, and, when they are, they are often found near the end of the study, right when data management and statisticians are trying to produce results, and sometimes after interim analyses.

The average baseline

Some protocols specify that baseline consists of the average of three repeated measurements. Again, this decision is often made before any statisticians are consulted. The issue with such a statistical analysis is that averages are not easily comparable to raw values. Let's say that a baseline QTc (a measure of how fast the heart charge recovers from a pump, corrected for heart rate) is defined based on 3 electrocardiogram (ECG) measurements. The standard deviation of a raw QTc measurement (i.e. based on one ECG), let's say, is s. The standard deviation of the average of those three (assuming independence) is s/√3, or just above half the standard deviation of the raw ECG. Thus, a change of 1 unit in the average of 3 ECGs is a lot more noteworthy than a change of 1 unit in a single ECG measurement. And yet we compare that to single measurements for the rest of the study.

To make matters worse, if the ECG machine screws up one measurement, then the baseline becomes the average of two. A lot of times we lose that kind of information, and yet analyze the data as if the mystery average is a raw measurement.

The extreme baseline

In one observational study, the sponsor wanted to use the maximum value over the last 12 months as a baseline. This was problematic for several reasons. Like the average, the extreme baseline (here the maximum) is on a different scale, and even has a different distribution, than the raw measurement. The Fisher-Tippett (extreme value) theorem states that the maximum of n values converges to one of three extreme value distributions (Gumbel, Frechet, or Weibull). These distributions are then being compared to, again, single measurements taken after baseline. What's worse, any number of measurements could have been taken for those subjects within 12 months, leading to a major case of shifting sands regarding the distribution of baseline.

Comparing an extreme value with a later singular measurement will lead to an unavoidable case of regression to the mean, thus creating an apparent trend in the data where none may exist. Without proper context, this may lead to overly optimistic interpretations of the effect of an intervention, and overly small p-values. (Note that a Bayesian analysis is not immune to the misleading conclusions that might arise from this terrible definition of baseline.)

Conclusion

The definition of baseline is a "tiny decision" that can have major consequences in a statistical analysis. Yet, the impact of this decision has not been well studied, especially in the context of a clinical trial where a wide range of definitions may be written into a protocol without the expert advice of a statistician. Even a definition that has been well-accepted -- that baseline is the last singular pre-dose value before intervention -- has not been well-studied in the scenario of missing baseline day measurement. Other decisions are often made without considering the impact on analysis, including some that may lead to wrong interpretations.

Friday, May 20, 2016

Simulating a Weibull conditional on time-to-event is greater than a given time

Recently, I had to simulate a time-to-event of subjects who have been on a study, are still ongoing at the time of a data cut, but who are still at risk of an event (e.g. progressive disease, cardiac event, death). This requires the simulation of a conditional Weibull. To do this, I created the following function:




# simulate conditional Weibull conditional on survival > T ---------------

# reliability function is exp{-(T+t/b)^a} / exp{-(T/b)^a} = 1-F(t)
# n = number of points to return
# shape = shape parm of weibull
# scale = scale parm of weibull (default 1)
# t is minimum (default is 0, which makes the function act like rweibull)
my_rweibull <- function(n,shape,scale=1,t=0) {
  if (length(t)!=1 && length(t)!=n) {
    stop("length(t) is not 1 or n")
  }
  return(scale*(-log(runif(n))+(t/scale)^shape)^(1/shape))
}



You use this function just like rweibull, with the exception that you pass in another vector t of minimum times or a scalar representing the minimum time of all simulated values. The idea is that the probability that the random variable will be at least T is given by exp{-(T+t/b)^a} / exp{-(T/b)^a}, so you can simulate this with a uniform random variate. I use the inversion method on the reliability function (just like using the inversion method on the distribution function, with the insight that if U is uniform(0,1), so is 1-U).

Truth be told, I ought to buckle down and learn how to do packages in R, but for now I'll just pass on some code on my blog if I think it will be helpful (or if I need to find it while doing a Google search later).

(Edit on 7/16: OOPS! A previous version of this had the scale and shape parameters switched. I've corrected it now. If you copied this before 7/16/2016, please check again.)

Thursday, January 14, 2016

Talk to Upstate Data Science Group on Caret

Last night I gave an introduction and demo of the caret R package to the Upstate Data Science group, meeting at Furman University. It was fairly well attended (around 20 people), and well received.

It was great to get out of my own comfort zone a bit (since graduate school, I've only really given talks on some topic in biostatistics) and meeting statisticians, computer scientists, and other sorts of data scientists from many different fields. This is a relatively new group, and given the interest over the last couple of months or so I think this has been sorely needed in the Upstate South Carolina region.

We'll be participating in Open Data day in March of this year, so if you are in the Upstate SC region, or don't mind making the trek from Columbia or Western NC, find us on Meetup. Our next meeting is a data hack night which promises to be interesting.