Realizations in Biostatistics: 2007

Friday, December 7, 2007

Dynamic graphs in any web page via Google

Figure 1: what theoretical respondents drink right before producing pie charts

Google now has an API for generating dynamic graphs from web site. All you need to do is put in the HTML! You can generate line charts, bar charts, pie charts, Venn diagrams, and scatterplots. Full instructions here.

(via Information Aesthetics)

Saturday, November 24, 2007

Quick lesson: how does SAS calculate confidence intervals?

Occasionally, I'll go through the referrals to this site. One referral I found was "how does SAS calculate confidence intervals." Here's a brief explanation.

1. What kind of confidence intervals are you talking about? Regression coefficients, sample means, sample variances, prediction intervals, differences of sample means, ratios?
2. Once you've got that figured out, go to the SAS manual. If you have SAS, it's in the help online. Otherwise, you can probably find it on the web. Go to the documentation for the right procedure (depends on what confidence interval you are calculating). In there, there is a section on details and computational considerations. In those sections SAS details the statistical theory they use for all of their procedures as well as any methods they use to make the procedure more efficient. If that isn't enough, they give copious references.

Back to work for me.

Wednesday, November 14, 2007

Bayes is big

When I was in graduate school, Bayesian statistics was a small, but important, part of my statistical inference curriculum. When I graduated, I all but forgot it. But a couple of years ago, I saw the storm on the horizon, and started furious self-study, including theory and computation. A few months ago, that storm hit shore. At the Joint Statistical Meetings 2007, I got raised eyebrows when I told former colleagues that life was turning me into a Bayesian, but I also met some prominent figures in the Bayesian biostatistics movement.

And now it's happening again.

About a year ago, I predicted over at Derek Lowe's excellent blog that a drug development program based on a full Bayesian approach would be 10 years off, though drug safety would probably see the largest immediate application. I was wrong. The storm is creeping inland. Be ready for it.

Wednesday, November 7, 2007

When graphics fail

I love graphical methods, and think that biostatisticians ought to use them more (in many sectors of my industry, they are using graphical methods more and are advancing the field). However, there are times when graphics deceive us even if they are done correctly, such as when one is trying to compare two overlaid time series.

Junk charts shows a recent example from the NYT. I think this example also shows that creating truly illuminating graphics is both an art and a science.

Wednesday, September 19, 2007

My O'Brien-Fleming design is not the same as your O'Brien-Fleming design

I know this discussion is a little technical, and nonstatisticians can probably skip this, but I hope that a statistician struggling with the O'Brien-Fleming design and its implementation in SAS/IML (notably the SEQ, SEQSHIFT, and SEQSCALE functions) can find this from a search engine and save hours of headache.

There are two ways of designing an O'Brien-Fleming design, a popular design for conducting interim analyses of clinical trials. The first method is to use an error (or alpha) spending function, which essentially gives you a "budget" of error you can spend at each interim analysis. The second is to realize that, if you are looking at cumulative sums in the trial, the O'Brien-Fleming design terminates if you cross a constant threshhold. In the popular design programs LD98 and PASS 2007, the spending function approach is used. In the book Analysis of Clinical Trials using SAS, (a book I highly recommend, by the way), the cumulative sum approach is used at the design stage (the spending function is used at the monitoring stage). When interim analyses are equally spaced, the two approaches give the same answer. When interim analyses are not equally spaced, the two approaches seem to give different answers. What's more, the spending function for O'Brien-Fleming as implemented in LD98 and PASS are different from what they show you in the books. They use:

4 - 4*PHI(z(1-alpha/4)/sqrt(tau))

for two-sided designs.

They don't tell you these things in school. Or in the books.

Update: Steve Simon's post on the topic has moved as of 11/21/2008. Please see the third comment below.

Monday, September 17, 2007

He makes the data sing their story

Hans Rosling gave a TED talk in 2006. If you love to work with data, you must watch this.

By the way, I am a statistician, and I love it. Yeah, this is all observational and "hypothesis-generating," as we like to say, but letting the data sing their story tells us where to concentrate our efforts.

Saturday, September 1, 2007

Bias in group sequential designs - site effect and Cochran-Mantel-Hanszel odds ratio

It is well known that estimating treatment effects from a group sequential design results in a bias. When you use the Cochran-Mantel-Haenszel statistic to estimate an odds ratio, the number of patients within each site affects the bias in the estimate of the odds ratio. I've presented the results of a simulation study, where I created a hypothetical trial and then resampled from this trial 1000 times. I calculated the approximate bias in the log odds ratio (i.e. log of the CMH odds ratio estimate) and plotted that versus the estimated log odds ratio. The line is cubic smoothing spline, made by the statement symbol i=sm75ps in SAS. The actual values are underprinted in light gray circles just to get some idea of the variability.

Sunday, August 19, 2007

Quick hits - the pharmacogenomics of Warfarin, and the statistical analysis of TGN1412

1. Terra Sigillata explains the pharmacogenomics of warfarin far better than I can.

2. Andrew Gelman discusses the statistical analysis of TGN1412, a clinical trial that resulted in a cytokine storm (which lead to multiple organ failure, cancer, gangrene, and amputation) for six very unfortunate volunteers and caused the fledgling biopharmaceutical company TeGenero to go bankrupt. I don't think anyone ever thought of doing a statistical analysis of the trial, simply because it was unnecessary. As it turns out, if you do a classical statistical analysis on the data, you get a result that isn't statistically significant! Basing a scientific conclusion on such an analysis is clearly absurd, and serves to highlight the limitations of statistics, or, rather, the way we think about and use statistics. The crux of the matter is that the adverse effects would have been significant if even one person had experienced a cytokine storm. Frequentist statistics can't pick up on that assumption, and Bayesian statistics would probably require an otherwise absurd prior to pick it up.

I guess there's a lot going on with statistics this week.

What does the First Ever Pharma Blogosphere Survey tell us

First, let me make a few comments. I find John Mack's Pharma Marketing blog useful. Marketing tends to be a black box for me. From my perspective, for the inputs you have guys who want to sell things, and for the outputs you have commercials and other promotional materials. I (partly by choice and partly by the way my brain works) understand very little about what happens between input and output. All I know about it is play my strengths and downplay my weaknesses. This is part of the reason I'm limiting myself to discussing statistical issues, at least on this blog.

However, when he came out with his First Ever Pharma Blogosphere Survey®©™, I was skeptical. In fact, I didn't pay much attention to it. But then he started making claims based on the survey, especially surrounding Peter Rost's new gig at BrandweekNRX. His predicted Brandweek would "flush its creditibility down the toilet" by hiring Rost, and cited his survey data to back up his case (he had other arguments as well, but, as noted above, I'm just covering what I know). And since I'm skeptical of his data, I'm skeptical of his analysis, and, therefore, his arguments, conclusions, and predictions based on the data. To his credit, however, he posts the raw data so at least we know he didn't use a graphics program to draw his bar graphs.

Rost's counterarguments are worthy of analysis as well. He notes that most people read the Pharma Marketing blog (the survey was conducted from its sister site Pharma Blogosphere), raising the question about which population Mack was really sampling. The correct answer, of course, is people who happened to read that blog entry around the day it was posted who cared enough to bother to take a web survey. I would agree that Mack's following probably make up a bulk of the survey.

But more important is the comparison of the survey to more objective data, such as site counters. (Note that site counter data isn't perfect, either, but it is more objective than web polls since the data collection does not require user interaction.) And it looks like that objective data doesn't match Mack's data.

Then you throw in the data from eDrugSearch, which has its own algorithm for ranking healthcare websites, but they seem a very out of line with the ranking algorithm from that of Technorati, which uses some modifications to the number of incoming links (I think to adjust for the fact that some blogs just all link to one another).

So, at any rate, you can be sure that Peter Rost will keep you abreast of his rankings, and for now they certainly do not seem to match Mack's predictions. And, while the eDrugSearch and Technorati rankings seem far from perfect, they do tend to agree on the upward trend in readership of BrandweekNRx and Rost's personal blog, at least for now. Mack's survey, and the predictions based on them, are the only data I've seen so far that have not agreed.

In the meantime, I say the proof is in the pudding. Read these sites, or, better yet, put them in an RSS reader so you can skim for the material you like. Discard the material you don't like. As for me, well, I like to keep abreast of the news in my industry because, well, it could affect my ability to feed my children. So far, Rost's blog breaks news that doesn't get picked up anywhere else, (as does Pharmalot and PharmGossip). Mack's blogs did, too, at least until he started getting obsessed with his subjective evaluation of Rost's content.

Web polls in blog entries - I don't trust them

I distrust web polls. While there are more trustworthy sources of polling such as Surveymonkey, these web surveying sites have to be backed up with essential the same type of operational techniques found in standard paper surveys. The web polls I distrust are ones that that bloggers put in their entries in their blog entries to poll their readers on their thoughts of certain issues. Sometimes they will even follow up with an entry saying "this isn't a scientific poll, but here are the results."

A small step up from this are the web surveys, such as John Mack's First Ever Pharma Blogsphere Survey®™©. They have a lot of the same problems as the simple web poll, and few of the controls necessary to ensure valid results. So I'll discuss simple one-off web polls and web surveys together.

Most of the problems and biases with these web polls aren't statistical; rather, they are operational. The data from these is so bad that no amount of statistics can rescue them. It's better not to even bring statistics into the equation here. Following are the operational biases I consider unavoidable and insurmountable:

Most web polls do not control whether one person can vote multiple times. Most services will now use cookies or IP addresses to block multiple votes from one person, but these services are imperfect at best. Changing an IP address is easy (just go to a different Starbucks, and cookies can be deleted). Cookies are easily deleted.
Wording questions in surveys is a tricky proposition, and millions of billable hours are spent agonizing over the wording. (Perhaps 75% of that is going a bit too far, but you get the point.) Very little time is generally spent wording the question of a web poll. The end result is that readers may not be answering the same question a blogger asks.
Forget random sampling, matching cases, identifying demographic information, or any of the classical statistical controls that are intended to isolate noise and false signal from true signal. Web poll samples are "People who happen to find the blog entry and care enough to click on a web poll." At best, the readers who feel strongly about an issue are the ones likely to click, while people who are feel less strongly (but might lean a certain way) will probably just glaze over.
Answers to web polls will typically be immediate reactions to the blog post, rather than thoughtful, considered answers. Internet life is fast-paced, and readers (in general) simply don't have the time to thoughtfully answer a web poll.

Web polls and surveys might be useful for guaging whether readers are interested in a particular topic posted by the blogger, and so they do have a use in guiding the future material in a blog. But beyond that, I can't trust them.

Next step: an analysis of the John Mack/Peter Rost kerfluffle.

Shout outs

Realizations is a tiny blog, getting just a tiny bit of traffic. After all, I cover a rather narrow topic. Every once in a while, someone finds an article on here worth reading, and, even less often, they link to it.

So, shout outs (some long overdue) to the people who have linked:

Kevin, MD (On flaws in the Avandia meta-analysis and potential regulatory fallout)
Peter Rost (On the statistical reporting system and how it could impact Novartis)
Kitchen Table Math (A general link to the site)
Captador (DCA)
Highlight Health (DCA)

Friday, August 17, 2007

Pharmacogenomics: "Fascinating science" and more

The FDA has issued two press releases in two days where pharmacogenomics has played the starring role:

Pharmacogenomics is the study of the interaction of genes and drugs. Most of the study, and certainly the most mature part of the field, has been on the study of drug metabolism, especially the cytochrome P450 enzymes, which are found in most kinds of life. The FDA's press releases are based on this science.

Pharmalot has reported on the mixed reactions to the warfarin news. One reaction was that "It's fascinating science, but not ready for prime time." Maybe not in general for all drugs, but the pharmacogenomics of warfarin has been studied for some time, and a greater understanding of the metabolism of this particular drug is critical to its correct application. Warfarin is a very effective drug, but it has two major problems:
1. The difference between the effective dose and a toxic dose is small enough to require close monitoring (i.e. it has a narrow therapeutic window)
2. It is implicated in a large number of adverse effects during ER visits (probably mostly for stroke or blood clots)

The codeine use update is even more urgent. Codeine is activated by the CYP2D6 enzyme, which has a wide variation in our population (gory detail at the Wikipedia link). In other words, the effects of codeine on people vary widely. The morphine that results from codeine metabolism is excreted in breast milk. If a nursing mother is one of the uncommon people who have an overabundance of CYP2D6, a lot of morphine can get excreted into breast milk and find its way into the baby. The results can be devastating. Fortunately, CYP2D6 tests have been approved by the FDA, and the price will probably start falling. Whether this science is ready for prime time or not (and CYP2D6 is probably the most studied of all the metabolism enzymes, so it probably is), it's fairly urgent to start applying this right away.

I applaud the FDA for taking these two steps toward applying pharmacogenomics to important problems. There may be issues down the road, but it's high time we started applying this important science.

Thursday, August 16, 2007

Good Clinical Trial Simulation Practices

I didn't realize they had gotten this far, and they did so 8 years ago! A group has put together a collection of good clinical trial simulation practices. While I only partly agree with the parsimony principle, I think the guiding principles are in general sound. I'd like to see this effort get wider recognition in the biostatistical community so that clinical trial simulations will get wider use. That can only help bring down drug development costs and promote deeper understanding of the compounds we are testing.

IRB abuses

Institutional Review Boards are a very important part of our human research. They are the major line of defense against research that degrades our humanity, and protects subjects in clinical research. Thank goodness they're there avoid a repeat of a nasty part of our history.

Unfortunately, as institutions do, IRBs have suffered from mission creep and a growing conservatism. It's a growing opinion that IRBs are overstepping their bounds and bogging down research and journalism that has no chance of harming human subjects. Via Statistical Modeling etc. I found IRBWatch, which details some examples of IRB abuses.

Tuesday, August 14, 2007

Review of Jim Albert's Bayesian Computation with R

When I first read Andrew Gelman's quick off-the-cuff review of the book Bayesian Computation with R, I thought it was a bit harsh. So did Gelman.

I thumbed through the book at the joint statistical meetings, and decided to buy it along with Bayesian Core. And I'm glad I did. Albert clearly positioned the book to be a companion to an introductory and perhaps even intermediate course in Bayesian statistics. I've found the book to be very useful to learning about Bayesian computation and deepening my understanding of Bayesian statistics.

The Bad

I include the bad first because there are few bad things.

I thought the functions laplace (which computes the normal approximation to a posterior using the Laplacian method) and the linear regression functions were a bit black-boxish. The text described these functions generally, but not nearly in the detail that it described other important functions such as rwmetrop and indepmetrop (which run random walk and independence Metropolis chains). Since I think that laplace is a very useful function, I think it would have been better to go into a little more detail. However, Albert did show the function in action in many different situations, including the computation of Bayes factors.
The choice of starting points for laplace seemed black-boxish as well. They were clearly chosen to be close to the mode (one of the functions of the function is to compute a mode of the log posterior distribution), but Albert doesn't really go into how to choose "intelligent" starting points. I recommend using a grid search using the R function expand.grid (and patience).
I wish the Chapter on MCMC included a problem on Gibbs sampling, though there is Chapter on Gibbs sampling in the end.
I wish it included a little more detail about accounting for the Jacobian when parameters are transformed. (Most parameters are transformed to the real line.)
I wish the book included more about adaptive rejection sampling.

The Good

In no particular order:

Albert includes detailed examples from a wide variety of fields. The examples vary in difficulty from run-of-the-mill (such as estimating a single proportion) to the sophisticated (such as Weibull survival regression with censored data). Regression and generalized linear models are covered.
The exercises really deepen the understanding of the material. You really need a computer with the R statistical package to read this book and get the most out of it. Take the time to work through the examples. Because I did this, I much better understand the Metropolis algorithms and the importance of choosing the right algorithm (and right parameters) to run an MCMC. Do it incorrectly and the results are compromised due to high (sometimes very high) autocorrelation and poor mixing.
The book is accompanied by a package LearnBayes that contain a lot of good datasets and some very useful functions for learning and general use. The laplace, metropolis, and gibbs (which actually implements Metropolis within Gibbs sampling) functions all can be used outside of the context of the book.
The book covers several different sampling algorithms, including importance, rejection sampling (not adaptive), and sample importance resampling. Along with this material are examples and exercises that show the importance of good proposal densities and what can happen with bad proposal densities.
A lot of the exercises extend exercises in previous chapters, so that the active reader gets to compare different approaches to the same problem.
The book heavily refers to other books on Bayesian statistics, such as Berry and Stangl's Bayesian Biostatistics, Carlin and Louis's Bayes and Emprical Bayes for Data Analysis, and Gelman, et al's Bayesian Data Analysis. In doing so, this book increases the instructive value of the other Bayesian books on the market.

Overall, this book is a great companion to any effort to learn about Bayesian statistics (estimation and inference) and Bayesian computation. Like any book, it's rewards are commensurate with the effort. I highly recommend working the exercises and going beyond the scope of the exercises (such as investigating diagnostics when not explicitly directed to do so). Read/work this book in conjunction with other heavy-hitter books such as Bayes and Empirical Bayes or Bayesian Data Analysis.

Wednesday, August 1, 2007

A good joint statistical meetings week

While I was not particularly enthralled with the location, I found this years Joint Statistical Meetings to be very good. By about 5 pm yesterday, I thought it was going to be so-so. There were good presentations on adaptive trials and Bayesian clinical trials, and even a few possible answers to some serious concerns I have about noninferiority trials. Last night I went to the biopharmaceutical section business meeting, and struck up conversations with a few people from the industry and the FDA (including the speaker who had some ideas on how to improve noninferiority trials). And shy, bashful me who had to drink 3 glasses of wine a couple of years ago to get up the courage to approach a (granted rather famous) colleague was one of the last ones to leave the mixer.

This morning, I was still feeling a little burned out, but decided to drag myself to a section on Bayesian trials in medical devices. I found the speakers (which came from both industry and FDA) top notch, and at the end the session turned into a very nice dialog on the CDRH draft guidance.

I then went to a session on interacting with the FDA in a medical device setting, and again speakers from both the FDA and industry were top notch. Again, the talks turned into very good discussions about how to most effectively communicate with the FDA, especially from a statistician/statistical consultant's point of view. I asked the question of how to handle the situation where, though it's not in the best interest, a sponsor wants to kick the statistical consults out of the FDA interactions. The answer: speak the sponsor's language, which is in dollars. Quite frankly, statistics is a major part of any clinical development plan, and unless the focus is specifically on chemistry, manufacturing, and controls (CMC), a statistician needs to be present for any contact with the FDA. (In a few years, it might be true for CMC as well.) If this is not the case, especially if it's consistently not the case throughout the development cycle of the product, the review can be delayed, and time is money. Other great questions were asked on use of software and submission of data. We all got an idea of what is required statistically in a medical device submission.

After lunch was a session given by the section on graphics and International Biometric Society (West N America Region). Why it wasn't cosponsored by biopharmaceutical, I'll never know. The talks were all about using graphs to understand effects of drugs, and how to use graphs to effectively support a marketing application or medical publication. The underlying message was get out of the 60's line printer era with the illegible statistical tables, and take advantage of new tools available. Legibility is key in producing a graph, followed by the ability to present a large amount of data in a small area. In some cases, many dimensions can be included on a graph, so that the human eye can spot potential complex relationships among variables. Some companies, notably big pharma, are far ahead in this arena. (I guess they have well-paid talent to work on this kind of stuff.)

These were three excellent sessions, and worth demanding more of my aching feet. Now I'm physically tired and ready to chill with my family for the rest of the week/weekend before doing "normal" work on Monday. But professionally, I'm refreshed.

Tuesday, July 31, 2007

Biostatistics makes the news, and hope for advances

Black hole or black art, biostatistics (and its mother field statistics) is a topic people tend to avoid. I find this unfortunate, because it makes discussions of drug development and surveillance news rather difficult.

Yet these discussions affect millions of lives, from Avandia to echinacaea to zinc. So I get tickled pink when a good discussion of statistics comes out in the popular press. They even quoted two biostatisticians who know what they are talking about, Susan Ellenberg and Frank Harrell, Jr. Thanks, BusinessWeek! Talking about the pros and cons of meta-analysis is a difficult task, and that's if you're aiming for an audience of statisticians. To tackle the topic in a popular news magazine is courageous, and I hope establishes a trend.

On the other hand, I have a few friends who cannot pick up a copy of USA Today without casting a critical eye. Turns out, they had a professor who constantly was mining the paper for examples of bad statistical graphics. (I have nothing against USA Today. In fact, I've appreciated their treatment of transfats.)

In other news, two new books on missing data have been released this year. Little and Rubin have released the second edition of their useful 1987 book. Molenberghs and Kenward have come out with their book that's designed specifically for missing data in clinical studies. I ended up picking up the latter for its focus, and I attended a workshop earlier this year by Geert Molenberghs that was pretty good. I'm very glad these books have been release because they're sorely needed. And at the Joint Statistical Meetings this year, there was a very good session on missing data (including a very good presentation by a colleague). I hope this means in the future we can think more intelligently about how to handle missing data because, well, in clinical trials you can count on patients dropping out.

Friday, July 20, 2007

Whistleblower on "statistical reporting system"

Whether you love or hate Peter Rost (and there seems to be very little in between), you can't work in the drug or CRO industry and ignore him. Yesterday, he and Ed Silverman (Pharmalot) broke a story on a director of statistics who blew the whistle on Novartis. Of course, this caught my eye.

While I can't really determine whether Novartis is "at fault" from these two stories (and related echos throughout the pharma blogs), I can tell you about statistical reporting systems, and why I think that these allegations can impact Novartis's bottom line in a major way.

Gone are the days of doing statistics with pencil, paper, and a desk calculator. These days, and especially in commercial work, statistics are all done with a computer. Furthermore, no statistical calculation is done in a vacuum. Especially in a clinical trial, there are thousands of these calculations which must be integrated and presented so that they can be interpreted by a team of scientists and doctors who then decide whether a drug is safe and effective (or, more accurately, whether a drug's benefits outweigh its risks).

A statistical reporting system, briefly, is a collection of standards, procedures, practices, and computer programs (usually SAS macros, but may involve programs in any language) that standardize the computation and reporting of statistics. Assuming they are well-written, these processes and programs are general enough to process the data any kind of study and produce reports that are consistent across all studies, and, hopefully, across all product lines in a company. For example, there may be one program to turn raw data into summary statistics (n, mean, median, standard deviation) and present them in a standardized way in a text table. Since this is a procedure we do many times, we'd like to just be able to "do it" without having to fuss over the details. We feed the variable name in (and perhaps some other details like number of decimal places) and voila the table. Not all statistics is that routine (and good for me because that means job security), but perhaps 70-80% is and can be made more efficient. Other programs and standards will take care of titles, footnotes, column headers, formatting, tracking, and validation in a standardized and efficient way. This saves a lot of time in both programming and in review and validation of tables.

So far, so good. But what happens when these systems break? As you might expect, you have to pay careful attention to these statistical reporting systems, even go so far as applying some software development life cycle methodology. If they break, you influence not just one calculation but perhaps thousands. And there is no way of knowing - obscure bugs in the code might influence just 10 out of a whole series of studies, where a more serious bug might affect everything. If this system is applied to every product in house (and it should probably be general enough to apply to at least one category of products, such as all cancer products), the integrity of the data analysis for a whole series of products is compromised.

Allegations were also made that a contract programmer was told to change dates on adverse events, which could either be a benign but bizarre request if the reasons for the change are well-documented (it's better to change dates in the database than at the program level, because it's easier to audit changes to a database and specific changes to specific dates keep a program from being generalizable to other similar circumstances) or an ethical nightmare if the changes were done to make the safety profile of the drug look better. From Pharmalot's report, the latter was alleged.

You might guess the consequences of systematic errors in data submitted to the FDA. The FDA does have the authority to kick out an application if it has good reason to believe that its data is incorrect. This application has to go through the resubmission process, after it is completely redone. (The FDA will only do this if there are systematic problems.) This erodes the confidence the reviewers have in the application, and probably even all applications submitted by a sponsor who made the errors. This kind of distrust is very costly, resulting in longer review periods, more work to assure the validity of the data, analysis, and interpretation, and, ultimately, lower profits. Much lower.

It doesn't look like the FDA has invoked its Application Integrity Policy on Novartis's Tasigna or any other product. But it has invoked its right to three more months of review time, saying it needs to "review additional data."

So, yes, this is big trouble as of now. Depending on the investigation, it could get bigger. A lot bigger.

Update: Pharmalot has posted a response from Novartis. In it, Novartis reiterates their confidence in the integrity of their data and claims to have proactively shared all data with the FDA (as they should). They also claim that the extension to the review time for the NDA was for the FDA to consider amendments to the submission.

This is a story to watch (and without judgment, for now, since this is currently a matter of "he said, she said"). And, BTW, I think Novartis responded very quickly. (Ed seems to think that 24 hours was too long.)

Thursday, June 21, 2007

Surreality in noninferiority trials: Advanced Life "statistically not inferior"

Noninferiority trials are trials that aim to show that one drug (or, more generically, factor) is not worse than another, older active drug. This is accomplished by setting a "noninferiority margin," running the trial, taking the difference between the two treatment effects, and finding out where the lower end of the 95% confidence interval lies. If the lower 95% confidence limit is higher than the noninferiority margin, then the new drug is noninferior to the other one.

These trials are useful (in pharmaceuticals) when placebo controlled trials are considered unethical or otherwise infeasible, and there are other treatments on the market, and are used all the time in anti-infectives (especially antibiotics). And, of course, they have their problems, like how to choose noninferiority margins (and I hope to shed a little darkness on this issue at an upcoming talk at the Joint Statistical Meetings). Don't look to the FDA for guidance; they've retracted all guidances and says that all margins must be justified statistically and clinically on a case-by-case basis.

And then there is this problem. The short of it goes like this: Advanced Life's new antibiotic treatment showed an effect that was slightly smaller than the active control, yet the lower 95% confidence limit still fell above the noninferiority margin. (This is the meaning of "statistically not inferior to ... Biaxin, although the latter drug technically performed better.") Of course, the consequences of this seemed pretty bad, and the press release language is certainly bizarre.

Noninferiority trials, in my opinion, are one of the failings of statisticians. We simply haven't figured out, at least in the classical statistics camp, how to effectively do this kind of analysis. The way we have it set up right now, we can have a drug that is slightly inferior slip under the radar, and if we approve a slightly inferior drug, and then use that as an active control against another drug that is slightly inferior to that, and so forth, you can end up with a truly worthless drug coming through with shining colors.

I haven't explored the Bayesian version of noninferiority, but it seems better to evaluate a drug on the basis of a posterior probability that the new drug is at least as good as the control than it does to set an arbitrary margin as see where the confidence interval of the difference falls. Unless we can come up with a better solution based on classical statistics.

Thursday, June 14, 2007

Look at the size of that trial! Avandia fallout? Let's hope not.

Usually, when we determine the sample size of a clinical trial, we calculate based on efficacy first. The International Committee on Harmonisation (ICH) has drug exposure guidelines which define the minimum. And if our efficacy sample sizes aren't quite enough to examine the safety issues outlined in the guidelines, we bump the sample size up.

Via Kevin, MD, I found someone who is sharing my concern that, in the recent swing of focus to drug safety, we are probably passing up drugs that may have ugly side effects, but are necessary in treating ugly diseases. Of course, no one wants to risk liver damage to quell a toothache (or do we? Hey, do you take acetaminophen?).

It goes like this: Steve Nissen's meta-analysis took over 42 trials and 28,000 patients to detect a statisticially significant result in cardiovascular risk for Avandia. If we want to be absolutely sure that a drug is safe and doesn't even have a small risk, this is the patient population size you will have to enroll in a development program.

I pretty much agree with the assessment that big pharma is to blame with dubious marketing practices and the blockbuster mentality. An FDA with the authority and will to enforce post-marketing commitments and safety surveillance will go a long way to identifying safety issues more quickly and identifying patients most at risk of adverse drug effects, much like the way that apolipoprotein E (ApoE) genotyping may identify the patients most at risk of cardiovascular side effects of Avandia.

That is, if we can keep our head on straight and remember the goal of drug research — to help people.

Saturday, May 26, 2007

On Avandia: The difference between statistically significant and not statistically significant ...

... is not statistically significant.

I'm placing less and less trust in p-values. Perhaps the Bayesians are getting to me. But hear me out.

In the Avandia saga, GSK got their drug approved on two adequate and well-controlled trials (and have studied their drug even more!). There was some concern over cardiovascular risks (including heart attack), but apparently the risk did not outweigh the benefits. Steve Nissen performs a meta-analysis on GSK's data from 42 (!) randomized control trials, and now the lawyers are lining up, and the FDA's favorite congressmen are keeping the fax lines busy with request letters and investigations.

Here's how the statistics is shaking out: the results from the meta-analysis shows a 43% increase in relative risk of myocardial infarction, with a p-value of 0.03. The (unspecified) increase in deaths didn't reach statistical significance with a p-value of 0.06.

Argh. Seriously, argh. Does this mean that the relative risk of myocardial infarction is "real" but the increase in deaths is "not real"? Does the 43% increase in relative risk even mean anything? (C'mon people, show the absolute risk increase as well!)

According to the Mayo clinic, the risk is 1/1000 (Avandia) vs. 1/1300 (other medications) in the diabetic study populations. That works out to a 30% increase in relative risk, not the same as what MedLineToday reported. The FDA's safety alert isn't very informative, either.

Fortunately, the NEJM article is public, so you can get your fill of statistics there. So, let me reference Table 4. My question: was the cardiovascular risk real in all studies combined (p=0.03), but not in DREAM (p=0.22), ADOPT (p=0.27), or all small trials combined (p=0.15)? That seems to be a pretty bizarre statement to make, and is probably why the European agencies, the FDA, and Prof. John Buse of UNC-Chapel Hill (who warned the FDA of cardiovascular risks in 2000) have urged patients not to switch right away.

The fact of the matter is if you look for something hard enough, you will find it. It apparently took 42 clinical trials, 2 of them very large, to find a significant p-value. Results from such a meta-analysis on the benefits of a drug probably wouldn't be taken as seriously.

Let me say this: the cardiovascular risks may be real. Steve Nissen's and John Buse's words on the matter are not to be taken lightly. But I think we need to slow down and not get too excited over a p-value that's less than 0.05. This needs a little more thought, not just because I'm questioning whether the statistical significance of the MI analysis means anything, but also because I'm questioning whether then non-significance of the mortality analysis means the death rates aren't different.

Update: Let me add one more thing to this post. The FDA realizes that p-values don't tell the whole story. They have statistical reviewers, medical reviewers, pharmacokinetic reviewers, and so forth. They look at the whole package, including the p-values, medical mechanism of action, how the drug moves through the body, and anything else that might affect how the drug changes the body. Likewise, Nissen and companies discusses the medical aspects of this drug, and doesn't let the p-values tell the whole story. This class of compounds -- the -glitazones (also known as PPAR agonists) -- are particularly troublesome for reasons described in the NEJM article. So, again, don't get too excited about p-values.

Tuesday, May 1, 2007

A plunge into the dark side

I'm referring, of course, to Bayesian statistics. My statistical education is grounded firmly in frequentist inference, though we did cover some Bayesian topics in the advanced doctorate classes. I even gave one talk on empirical Bayes. However, in the last 8 or so years, all that knowledge (such as it was) was covered over.

No more. I've had as a goal to get my feet wet again, because I knew some time or another I would have to deal with it. Well, that some time or another is now, and it probably won't be another eight years after this time before I have to do it again. So off to a short course I go, and equipped with books by Gelman, et al. and Gamerman, I'll be a fully functional Bayesian imposter in no time. I'm looking forward to it.

Tuesday, April 17, 2007

A tale of two endpoints

Some time ago, when Gardasil was still in clinical trials, I congratulated the Merck team for a product with 100% efficacy. After all, getting anything with 100% efficacy is a rare event, especially in drug/biologic development.

Apparently, that congratulations was a little too soon. Looks like Merck may have found a surrogate endpoint that their vaccine managed very well, but if you look at the important endpoint, the story doesn't look quite so rosy.

So, to be specific, Gardasil is marketed to protect against two strains of human pampilloma virus (HPV) that account for 70% of cervical cancer cases. (Types 16 and 18, for those keeping track.) Merck is going for 80% now by asking the FDA to add types 6 and 11 to the label.

Ed from Pharmalot notes that in clinical trials, among women that already have HPV, the vaccine reduces precancerous lesions (no time limit given) by 14%. For women that don't have HPV, the occurrence of precancerous lesions is reduced 46%. Presumably this is because the vaccine is ineffective against strains that already infect the body. Merck's spin engine is carefully worded to tout that 70%, even though that number is only of secondary importance. It's the 14% and 46% that really matter.

Addendum: I looked at the Gardasil PI, and they already mention 6 and 11. They also mention all other sorts of efficacy measures. The patient product information is less informative. My guess is Merck is overplaying the efficacy in their soundbites by shoving that 70% front and center, but its detractors are overplaying the gap between the 70% and the real story by shoving the 14% front and center.

I'm glad I'm a biostatistician, else I wouldn't be able to understand all this jockeying the numbers.

Simple, but so complex

So, in addition to statistics, I've been dabbling a little in fractal/chaos theory. Nothing serious, but enough to know that behind even the simplest functions there lies an amazing complex landscape. Who knew that z²+c could be so rich?

At any rate, I did all this stuff back in college, but in specializing I've forgotten most of it (except for the occasional admonition that it's often easy to confuse the complexity of dynamic systems for noise of a stochastic [random] system).

As I get older, it's become easier to lose the wonder. However, beneath every simple surface could be a world of complexity that will inspire a new round of curiosity.

Tuesday, April 3, 2007

Regulatory fallout from Tegenero's ill-fated TGN1412 trial

While biostatistics does not get used very much in early human clinical trials, any regulatory changes can have an effect on the practice. The EMEA has published new guidelines (pdf - in draft form, to be finalized after public comment and consultation with industry) about the conduct of Phase I trials for "high-risk" compounds. This comes in the wake of the infamous TGN1412 trial, in which a monoclonal antibody caused severe adverse reactions in all of the six otherwise healthy trial participants. (All six participants suffered multiple organ failure, along with gangrene. They will all probably contract and die of cancer within a few short years.)

The EMEA concluded that the trial was conducted in accordance with current regulations. These new recommendations are changes to avoid another similar disaster.

Among the recommendations:
- stronger pre-clinical data, and a stronger association between pre-clinical data and choice of dosing in humans (e.g. using minimal dose for biological activity), as opposed to the no observed adverse-event dose
- the use of independent data safety monitoring boards, along with well-defined stopping rules for subjects, cohorts, and trials
- well-defined provisions for dose-escalation
- increasing follow-up length for safety monitoring
- use of sites with appropriate medical facilities

(via Thomson Centerwatch)

Wednesday, March 28, 2007

A final word on Number Needed to Treat

In my previous post in this series I discussed how to create confidence intervals for the Number Needed to Treat (NNT). I just left it as taking the reciprocal of the confidence limits of the absolute risk reduction. I tried to find a better way, but I suppose there's a reason that we have a rather unsatisfactory method as a standard practice. The delta method doesn't work very well, and I suppose methods based on higher-order Taylor series will not work much better.

So, what happens if the treatment has no statistically significant effect (sample size is too small or the treatment simply doesn't work). The confidence interval for absolute risk reduction will cover 0, say, maybe -2.5% to 5%. Taking reciprocals, you get an apparent NNT confidence interval of -40 to 20. A negative NNT is easy enough to interpret: -40 NNT means that for every 40 people you "treat" with the failed treatment, you get a reduction of 1 in favorable outcomes. A 0 absolute risk reduction results in NNT=∞. So if the confidence interval of absolute risk reduction covers 0, the confidence interval must cover ∞. In fact, in the example above, we get the bizarre confidence set of -∞ to -40 and 20 to ∞, NOT -40 to 20. The interpretation of this confidence set (it's no longer an interval) is that either you have to treat at least 20 people but probably a lot more to help one, or if you treat 40 or more people then you might harm one. For this reason, for a treatment that doesn't reach statistical significance (i.e. whose absolute risk reduction includes 0), the NNT is often reported as a point estimate. I would argue that such a point estimate is meaningless. In fact, if it were left up to me, I would not report an NNT for a treatment that doesn't reach statistical significance, because the interpretation of statistical non-significance is that you can't prove with the data you have that the treatment helps anybody.

Douglas Altman, heavy hitter in medical statistics, has the gory details.

Technorati Tags: number needed to treat, NNT

Wednesday, March 21, 2007

SAS weirdness

From time to time, I'll complain about the weirdness of SAS, the statistical analysis program of choice for much of the pharmaceutical industry. This post is one such complaint.

Why, oh why, does SAS not directly give us the asymptotic variance of the Mantel-Haenszel odds ratio estimate? It does, however, give the confidence interval. Though the default is a 95% confidence interval, by specifying alpha=31.4 in the TABLES statement in the FREQ procedure and using ODS output to get these values into a dataset, you can compute the asymptotic variance by either dividing the upper confidence limit by the Mantel-Haenszel odds ratio estimate, or dividing the MH estimate by the lower confidence limit (both should give the same answer). The point is, SAS has to compute the asymptotic variance to calculate the confidence interval, so why not just go ahead and display it? (Yes, I understand that the confidence interval is symmetric only on a log scale.)

Addendum: R doesn't either. Same story. Weird.

Tuesday, February 13, 2007

Alternative medicine use might be affecting the results of trials

From here.

At issue is whether remedy-drug interactions are skewing the results of Phase I cancer trials. At present, this is hard to determine because it is hard to elicit (alternative) remedy use. To me, it's pretty clear that having remedy use out in the open is better than being secretive. However, what's interesting to me is the following, surveyed from 212 patients with advanced cancer enrolled in Phase I clinical trials:

72 (34 percent) use alternative remedies, similar to general US population usage
41 (19.3 percent) take vitamins and minerals
40 (18.9 percent) take herbal preparations

In addition, we have the following:

Sometimes, patients are reluctant to tell the doctor they are taking
alternative medicines, either because they don't think it's important,
or they don't want to be told to stop taking them, Daugherty said.

Also,

And, since it's often difficult to get cancer patients to take part in
phase 1 trials, some researchers may be reluctant to turn any potential
patient away. "In addition, most doctors don't know very much about
alternative medicine," Daugherty said.

For research, I think these matters need to be out in the open. For one thing, we need to understand our drugs we are developing. For another, we need to understand the alternative remedies.

Technorati Tags: cancer, alternative medicine

Friday, February 9, 2007

Big news that's easy to miss -- FDA clears a molecular prognostic tool for breast cancer metastasis

The FDA just approved a prognostic device. So what's the big deal?

I'll let the press release speak:

It is the first cleared molecular test that profiles genetic activity.

That's right. Despite many years (ok, about a decade to a decade and a half) of use in basic research, microarray technology has matured (along with the analysis methodologies) enough to be used in clinical practice, and this approval marks a big step toward that. Microarrays were a buzz in the statistical community a few years ago when there were still some methodological hurdles to overcome (and were being rapidly overcome).

What's more, this marks the first time a genetic expression test (different from a genetic test -- this one identifies which genes are expressed [active] at a particular time) has been approved for the prognosis of a disease. 70 gene expressions are analyzed. Agendia has blazed some trails, and I expect to see more of this kind of test in the coming years. And that's a good thing.

And, hopefully for women with breast cancer, this will help a bit in the decision making for treatment.

Tuesday, February 6, 2007

Dichloroacetate (DCA) - not an easy road to "cheap, patentless cancer cure"

Posts that contain Dca per day for the last 30 days.

Get your own chart!

Ginger, curcumin, and DCA have recently been touted for their anticancer properties, and, granted, in the Petri dish, they look pretty good. However, it's a long road from the Petri dish to the pharmacy shelves. Abel Pharmboy's coverage of DCA seems to be the most spot-on, and his points are worth repeating. Many compounds show promise in the Petri dish and animal models, but when it comes to human trials, they bomb. It is entirely possible, in fact, from drug development experience I would say very likely, that DCA will do very little for humans in trials. It may be ineffective when we actually inject it into human beings (for anticancer purposes; it's already approved for some metabolic disorders).

Remember, cancer is a complex disease. To "cure cancer" is really to cure a whole lot of different diseases, which is why our "war on cancer" applies some rather naive assumptions.

I'm all for supporting DCA research, and, unlike some of the more paranoid commenters on this issue, I think that Pharma companies are taking notice. It's not unthinkable that some NIH oncology funding is in the pipeline for the compound, or some small pharma company will in-license the compound/formulation/use and perhaps even bring some of Big Pharma's research dollars in if the compound passes Phase II trials (Phase I safety testing should be a breeze relatively speaking, since it's already approved for some indications). Before we get up in arms about what is patentable and whether research dollars will be spent on a promising compound, realize that nearly everybody in this industry is for helping others, and we will find a way to get the most promising compounds studied and, if they work, on the market.

Tuesday, January 2, 2007

Confidence limits on NNTs - a guide to comparing NNTs

Previously in this series I discussed the definition of the NNT (i.e. when comparing therapy to placebo, it's 1/absolute risk reduction of therapy) and how to interpret it (it's the expected number of people that you would have to treat to prevent one unfavorable outcome). It's evil twin, the NNH, is similarly calculated and interpreted in association with an adverse event.

In my first entry on the subject, a commenter asked whether the NNT is a number or a statistic with an error. The answer is that it's a statistic with an error. Problem is, most people do not report the error (or confidence interval) along with the NNT. The error or confidence interval helps us answer questions such as "Drug A has NNT of 21 and Drug B has NNT of 22. Is there really a difference?"

To begin with, I'll assume that all needed data is available in this form:

Risk of unfavorable outcome for placebo group
Risk of unfavorable outcome for treatment group
Sample size

To calculate the NNT itself, just do 1/(risk in placebo-risk in treatment). The sample size is not needed. However, to calculate the error, the sample size is necessary.

Then, we calculate the error for (risk in placebo-risk in treatment) (i.e. 1/NNT). The expression is a little more complicated, but not hard to put into a spreadsheet or calculator:

std error = sqrt(risk placebo * (1 - risk placebo) / (# in placebo group) + risk treatment * (1 - risk treatment) / (# in treatment group)),

where sqrt means take the square root of the whole thing. A simple explanation goes as follows:

risk group A * (1 - risk group A) / (# in group A)

is the variance of the estimate in group A. Add the variances in the placebo and treatment groups to get the variance of the treatment, and then take the square root to get the error. So two principles: variances (often) add, and the error is the square root of the variance.

To get a 95% confidence interval of the risk reduction, you take the difference and add/subtract 2 times the error¹.

Example. In the last entry I compared niacin and simvastatin. The article has some of the information we need:

Drug	Risk placebo	Risk treatment	Sample size
Niacin	36.5%	31.5%	1390
Simvastatin	21.5%	13.5%	2221 (sim) 2223 (placebo)

I had to do some sleuthing to get the sample size numbers. For niacin a Google search for the Coronary Heart Disease project landed this draft of a report, from which I found a total sample size and divided by 6 (there were six groups). For the simvastatin number I used the Wikipedia entry on Scandinavian Simvastatin Survival Study to get the sample size. But anyway, we're able to do a confidence interval calculation. We start with niacin:

risk reduction = (36.5% - 31.5%) = 5% (so NNT = 1/5% = 20)
error = sqrt(0.365 * 0.635/1390 + 0.315*0.685/1390) = 0.018 = 1.8%
95% confidence interval of risk reduction is 5%-2*1.8% to 5%+2*1.8% = 2.4% to 8.6%

The 95% confidence interval for risk reduction for simvastatin is 5.8% to 10.2%. (I got a risk reduction of 8% and standard error of 1.1%).

End example.

The simplest way to get a 95% confidence interval for the NNT is to just do 1/confidence limits. You will also have to invert the order of the limits. Granted, this isn't necessarily the best way, and I'll probably show how to do another way one day, but it's easy and are actual (approximate) 95% confidence limits. So the NNT limits for niacin are (1/8.6% to 1/2.4%) = (11.63 to 41.7). The NNT limits for simvastatin are (9.8 to 17.2).

From this quick and dirty calculation, it's not absolutely clear that niacin has higher efficacy than simvastatin. Part of the reason for this is the wide error in the risk reduction estimate for niacin, which comes from the fact that 31.5% of subjects in the niacin group of the study had a cardiovascular event.

A few other issues are worth point out here, and they cloud the issue even more. I took these numbers from two different studies: the 4S study and the Coronary Heart Disease project. If you look at these two studies, they have different inclusion criteria (e.g. the CHD project had an inclusion criterion of men only). Eventually, in trying to get the information we need, we come across such barriers. Preferably, the numbers I used above would have come from the same study, and given the differences between objectives and populations in the studies, the comparison between the simvastatin and niacin NNTs are not as straightforward as back-of-the-envelope calculations as given above can lead you to believe. It's important to keep the limitations of both the data and the statistics in mind.

¹Technical details: this is an approximate interval, and those who have been through stats classes may prefer to use z_0.025=1.965. I don't think it matters too much except in cases that accuracy is very important such as academic reports and regulatory submissions. Also, this confidence interval has fallen out of favor with statisticians, but is easy and useful for the kinds of back-of-the-envelope things I'm doing here. [back]

Monday, January 1, 2007

Not quite a repost - Number Needed to Treat (NNT)

Rather than repost this entry on the NNT, I thought I would discuss the issue a little further. For background, here are some references:

For the individual, the NNT doesn't really matter. After all, when you take a drug, it doesn't matter what happens to other people on the drug. It only matters what happens to you. However, for public policy makers and insurance companies, the NNT has become very important. The reasoning goes as follows:

Say you, as an insurance company, wanted to compare two therapies to prevent serious cardiovascular events (e.g. cardiovascular death, myocardial infarction): niacin and simvastatin. The cost of a cardiac event is high both economically (in terms of health care cost and days missed) and in pain and suffering. Then we can answer the following question: what is the cost of preventing one event using niacin and simvastatin?

You can organize the work as follows:

Therapy	Source	Duration (years)	Cost/Day	NNT	Total Cost/1 prevention
Niacin	Coronary Drug Project	6.2	$0.21	20	$9,511.11
Simvastatin	4S	5.4	$0.93	13	$23,845.71

The only calculated column is the last one (you can easily set this kind of table up in any spreadsheet). The calculation is total=duration*365.25*cost/day*NNT. In addition, I used a favorable cost for simvastatin (although it will get more favorable when more generics hit the market) and an unfavorable cost for niacin. Source of data is Tables 1 and 2 from Therapeutics Letter, May 1998 as shown here. Note that cerivastatin has been withdrawn since then, and a generic form of simvastatin has hit the market.

You can interpret the last column as follows: to prevent one cardiovascular event, you expect to pay $9,511.11 for niacin therapy or $23,845.71 for simvastatin therapy.

Simvastatin actually doesn't compare too unfavorably with niacin therapy. Other measures such as safety profile (including NNH - Number Needed to Harm - for the more serious adverse events) are needed in the decision making process, but under this measure we should not rule out simvastatin as a valid and effective therapy. The main issue is unit cost, something that will change as generics come on the market or something that can be negotiated down (especially in the case of transporting drugs to developing countries).

Speaking of safety profile, that is something that hasn't been figured into the table above. The costs associated with flushing, gastrointestinal problems, skin problems, and acute gout, -- all associated with niacin -- along with the NNH for these issues, need to be addressed. On the simvastatin side, creatine kinase elevation with muscle weakness and rhabdomyolysis need to be addressed, along with the other adverse events that have been found to be associated to statin therapy in the last couple of years. These expected costs are to be added to the costs in the table above.

There are a few caveats to the NNT a couple of which I mention below:

First, the NNT is a number derived from an estimate, i.e. 1/absolute risk reduction. Though most estimates are reported with a standard error, NNTs are not (and this is a flaw). Likewise, the costs above have a range deriving from several sources: range of cost, range of NNT, range of durations studied.
The NNT is based on population-based statistics. For an individual making an individual decision about healthcare, it carries less weight than it would for an insurance company deciding which therapies to cover or a healthcare NGO deciding which therapies to pay for transport into developing countries. Side effect risk factors, metabolism profile, and other individual factors carry more weight (and those carry less, but higher than zero, weight with policy makers).

So next on my plate in the NNT series is how to compute a standard error from measures that you see in the literature.