Wednesday, August 29, 2012

Integrating R into a SAS shop

I work in an environment dominated by SAS, and I am looking to integrate R into our environment.

Why would I want to do such a thing? First, I do not want to get rid of SAS. That would not only take away most of our investment in SAS training and hiring good quality SAS programmers, but it would also remove the advantages of SAS from our environment. These advantages include the following:

  • Many years of collective experience in pharmaceutical data management, analysis, and reporting
  • Workflow that is second to none (with the exception of reproducible research, where R excels)
  • Reporting tools based on ODS that are second to none
  • SAS has much better validation tools than R, unless you get a commercial version of R (which makes IT folks happy)
  • SAS automatically does parallel processing for several common functions

So, if SAS is so great, why do I want R?

  • SAS’s pricing model makes it so that if I get a package that does everything I want, I pay thousands of dollars per year more than the basic package and end up with a system that does way more than I need. For example, if I want to do a CART analysis, I have to buy Enterprise Miner, which does way more than I would need.
  • R is more agile and flexible than SAS
  • R more easily integrates with Fortran and C++ than SAS (I’ve tried the SAS integration with DLLs, and it’s doable, but hard)
  • R is better at custom algorithms than SAS, unless you delve into the world of IML (which is sometimes a good solution).

I’m still looking at ways to do it, although the integration with IML/IML studio is promising.

Monday, August 27, 2012

Romney’s “secretive data mining”–could the same techniques be used for clinical trial enrollment?

Romney has been “exposed” as using “secretive data mining techniques” to find donors to his campaign in traditional Democratic strongholds. (These techniques can be learned in any of these free online courses offered through Coursera and Udacity along with the massive databases collected by the different parties.)

Of course, my thought is, can we use these techniques to find potential participants in clinical trials? I think that if we can work out the privacy issues, this represents a useful tool for clinicians to find not just trial participants, but patients who need to be treated, but for some reason are not being treated. This could be a win for everybody.

Other ideas:

  • using Google trends, much like Google uses to identify flu outbreaks
  • mining discussion boards
  • identifying need through blog networks

I’ll be taking the Web Intelligence and Big Data class through Coursera, so maybe I’ll get more ideas.

Monday, August 20, 2012

Clinical trials: enrollment targets vs. valid hypothesis testing

The questions raised in this Scientific American article ought to concern all of us, and I want to take some of these questions further. But let me first explain the problem.

Clinical trials and observational studies of drugs, biologics, and medical devices are a huge logistical challenge, not the least of which is finding physicians and patients to participate. The thesis of the article is that the classical methods of finding participants – mostly compensation – lead to perverse incentives to lie about one’s medical condition.

I think there is a more subtle issue, and it struck me when one of our clinical people expressed a desire not to put enrollment caps on large hospitals for the sake of a fast enrollment. In our race to finish the trial and collect data, we are biasing our studies toward larger centers where there may be better care. This effect is exactly the opposite of that posited in the article, where treatment effect is biased downward. Here, treatment effect is biased upward, with doctors more familiar with best delivery practices (many of the drugs I study are IV or hospital-based), best treatment practices, and more efficient care.

We statisticians can start to characterize the problem by looking at treatment effect by different sites, or using hierarchical models to separate out center effect from drug. But this isn’t always a great solution, because low-enrolling sites, by definition, have a lot fewer people, and pooling is problematic because low-enrolling centers tend to have way more variation in level and quality of care than high-enrolling centers.

We can get creative on the statistical analysis end of studies, but I think the best solution is going to involve stepping back at the clinical trial logistics planning stage and recasting the recruitment problem in terms of a generalizability/speed tradeoff.

Wednesday, August 15, 2012

Statisticians need marketing

I can't think of too much I would disagree with in Simply Statistics's posting on Statistics/statisticians need better marketing. I'll elaborate on a couple of points:

We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era.
Apparently, the idea of data mining was rejected by most statisticians about 30 years ago, and it has found a home in computer science. Now data science is growing out computer science, and analytics seems to be growing out of some hybrid of computer science and business. The embracing of the terms "data science" and "analytics" puzzled me for a long time, because these fields seemed to be just statistics, data curation, and understanding of the application. (I recognize now that there is some more to it, especially the strong computer science component especially in big data applications.) I now see the tension between statisticians and practitioners of these related fields, and the puzzlement remains. Statisticians have a lot to contribute to these blooming fields, and we damn well better get to it.

We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be. 
Stanford has this down to a business plan. So do some other universities. This trail is being blazed, and we can just hop on it.

It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data.
Statistics without borders is a great place to do this. Perhaps SWB needs better marketing as well?

Monday, August 13, 2012

Observational data is valuable

I’ve heard way too many times that observational studies are flawed, and to really confirm a hypothesis you have to do randomized controlled trials. Indeed, this was an argument in the hormone replacement therapy (HRT) controversy (scroll down for the article). Now that I’ve worked with both observational and randomized data, here are a few observations:

  • The choice of observational vs. randomized is an important, but not the only, study design choice.

    Studies have lots of different design choices: followup length, measurement schedule, when during disease course to observe, assumptions about risk groups, assumptions about stability of risk over time (which was important in the HRT discussion about breast cancer), and the list goes on. A well-designed observational trial can give a lot of more valid information than a poorly-designed randomized trial.
  • Only one aspect of a randomized trial is randomized (usually). Covariates and subgroups are not randomized.
  • Methods exists to make valid comparisons in an observational study. While data have to be handled much more carefully, and assumptions behind the statistical methods have to be examined more carefully. However, very powerful methods such as causal analysis or case-control studies can be used to make strong conclusions.

Observational studies can complement or replace randomized designs. In fact, in controversies such as the use of thimerosol in vaccines, observational studies have been required to supply all the evidence (randomizing children to thimerosol and non-thimerosol groups in a randomized study to see if they develop autism is not ethical). In post-marketing research and development for drugs, observational studies are used to further establish safety, determine the rate of rare serious adverse events, and determine the effects of real-world usage on the efficacy that has been established through randomized trials.

Through careful planning, observational studies can generate new results, extend the results of randomized trials, or even set up new randomized trials.

Wednesday, August 8, 2012

The problem of multiple comparisons

John Cook's discussion at Wrong and unnecessary — The Endeavour and the comments to that post are all worth reading (I rarely say that about comments to a post). The post is ostensibly on whether a linear model is useful even though it is no, in the perfect sense of the word, correct. In the comments, the idea of multiple comparisons is brought up, and not just whether they are appropriate but to what extent comparisons must be adjusted.

(For those not familiar with the problem, the Wikipedia article is worth reading. Essentially, multiple comparisons is a problem in statistics where you have a greater chance of declaring statistical significance if you compare multiple endpoints naively, or compare the same one over and over as you collect more data. Statisticians have many methods for adjusting for this effect depending on the situation.)

In pivotal trials, usually only the primary endpoints have multiple comparisons applied, and sometimes the multiple comparisons are applied separately to secondary endpoints. (I find this practice bizarre, though I have heard it endorsed at least once by a reviewing team at the FDA.) Biostatisticians have complained that testing covariate imbalances at baseline (i.e. did people entering into treatment and placebo group have the same distribution of ages?) add to the multiple comparisons problem, even though these baseline tests do not directly lead to some declaration of significance. Bayesian methods do not necessarily "solve" the multiple comparisons problem, but rather account appropriately for multiple testing when looking at data multiple times or looking at multiple endpoints, if the methodology is set up appropriately.

Multiple comparison methods tend to break down in situations of a large number of experiments or dimensions, such as "A/B" experiments that web sites tend to run or testing for the existence of adverse events of a drug, where sponsors tend to want to see a large number of Fisher's Exact tests. (In fact, the second situation suffers from the more fundamental problem that committing a Type I error - declaring the existence of an adverse event where there is none - is more conservative than committing a Type II error. Multiple comparison adjustments can lead to erroneous assurances of safety, while not adjusting can lead to lots of additional research confirming or denying the existence of significant Fisher's tests.)

I have even heard the bizarre assertion that multiple comparison adjustments are required when discussing several tests on one biological process, but then you get another multiple comparison adjustment when you do tests on another process even in the same study.

I assert that while we have a lot of clever methods for multiple comparisons, we have bizarre and arbitrary rules for when to apply them. Multiple comparison methodologies (simple, complex, or modern) control Type I error rate or similar measures over the tests to which they are applied, so I think that the coverage of these tests needs to be justified scientifically beyond such arbitrary rules.

'via Blog this'

Monday, August 6, 2012

Getting connected: why you should get connected to people, and how

Getting connected to professionals in your field can be difficult, but it’s worth the effort. Here’s why, in no particular order:

  1. you exchange ideas, find new ways to approach problems, share career experiences, and learn to navigate the multitude of aspects of your profession
  2. you form connections that can potentially  help if you need to change jobs
  3. it’s fun to be social (even if you are an introvert like me)
  4. you can potentially add a lot of value to your company, leading to career advancement opportunities
  5. you can justify going to cool conferences, if you enjoy those
  6. you have a better chance of new opportunities to publish, get invited talks, or collaborate

The above is fairly general, but for the how I will focus on statisticians because that is where I can offer the most:

  1. Join a professional organization. For statisticians in the US, join the American Statistical Association (ASA). Other countries have similar organization, for instance, the UK has the Royal Statistical Society, and Canada, India, and China have similar societies. In addition, there are more specialized groups such as the Institute for Mathematical Statistics, Society for Industrial and Applied Mathematics, East North American Region of the International Biometric Society, West North American Region of the International Biometric Society, and so forth. The ASA is very broad, and these other groups are more specialized. Chances are, there is a specialized group in your area.
  2. If you join the ASA, join a section, and find out if there is an active local chapter as well. The ASA is so huge that it is overwhelming to new members, but sections are smaller and more focused, and local chapters offer the opportunity to connect personally without a lot of travel or distance communication.
  3. You might start a group in your home town, such as an R User’s Group. Revolution Analytics will often sponsor a fledgling R User’s Group. Of course, this startup doesn’t have to be focused on R.
  4. If you have been a member for a couple of years, offer to volunteer. Chances are, the work is not glorious, but it will be important. The most important part, anyway, is that you will gain skills coordinating others and meet new people.
  5. If you go to a conference, offer to chair and try to speak. It is very easy to speak at the Joint Statistical Meetings.
  6. Use social media to get online connections, then try to meet these people in real life. I have formed several connections because I blog and tweet (@randomjohn).  You can also use Google+, though I haven’t quite figured out how to do so effectively. I also don’t use Facebook that much for my professional outlet, but it is possible. Blogging offers a lot of other benefits as well, if you do it correctly. Blogging communities, such as R Bloggers and SAS Community, enhance the value of blogging.

Getting connected is valuable, and it takes a lot of work. Think of it as a career-long effort, and your network as a garden. It takes time to set up, maintain, and cultivate, but the effort is worth it.

Thursday, August 2, 2012

JSM 2012 in the rearview: reflections on the world's largest gathering of statisticians

The joint statistical meetings is an annual gathering of several large professional organizations of statisticians, and annually we descend on some city to share ideas. I'm a perennial attendee, and always find the conference valuable in several ways. I have a few thoughts about the conference in retrospect:

* For me, networking is much more important than talks. Of course, attending talks in your area of interest is a great way of finding people with whom to network.
* I'm very happy I volunteered with the biopharmaceutical section a couple of years ago. It's a lot of work, but rewarding.
* This year, I specifically went to a few sections out of my area, and found the experience valuable.
* I definitely recommend chairing a session or speaking.
* I also recommend roundtable lunches. I did one for the first time this year, and found the back and forth discussion valuable.

In short, I find that connecting with like-minded professionals to be an important part of my career and development as a person.