Realizations in Biostatistics: Reproducible research in the drug development industry

1 What is reproducible research?

Reproducible research, in a nutshell, is the process of publishing
research in such a way that a person can pick up the materials and
reproduce the research exactly. This is an ideal in
science. Essentially, all data, programming code, and interpretation
is presented in such a way that it is easy to see what was done, how
it was done, and why.

A report written in reproducible research style is written in such a
way that any result that comes from analyzing data is written in some
programming language inside the report. The written report is then
processed by software that will interpret the programming code and
replace it with both the code and the output from the code. The reader
of the report then sees exactly what code is executed to produce the
results, and the results that are shown in the report are guaranteed
to be from the code that is shown. This is different from, for
example, writing the code in the document and running it separately to
generate results which are copied and pasted back into the report. In
essence, the report and the analysis are done together, at the same
time, as a unit. An demo of how this works using the LaTeX, Sweave,
and R packages can be found here, and another example using R and
LaTeX, but not Sweave, can be found at Frank Harrell's rreport page.

Further information can be found at some of the links below (and the
links from those pages).

2 Challenges in doing reproducible research in industry

In some sense, drug development in the USA requires the highest
standard of reproducible research, simply because the reviewers at the
Food and Drug Administration want to reproduce all a sponsor's
analyses themselves. A lot of effort has gone into this, including
standardization through the Clinical Data Interchange Standards Consortium (CDISC), the common technical document (CTD) format and its
electronic counterpart, and emphasis on documentation.

Despite these efforts, I still think we fall short of efforts of the
literate programming and reproducible research efforts. For example,
even if SAS programs are sent along with a study report, it is not
easy to tell which SAS program generates which display. In fact,
sometimes, displays in the body of the study report (in-text tables)
are copied/pasted and rearranged, further breaking the association
between the summary and the data. Even worse, sometimes in-text tables
are tabulated by hand, which is prone to untraceable mistakes.

It is, of course, going to be technically difficult for industry to
fully embrace the principles of reproducible research. The
reproducible research movement has embraced open-source tools such as
LaTeX and R, while much of industry has embraced commercial tools such
as SAS and Microsoft Word. Russ Lenth a few years ago
demonstrated a working version of SASweave at the Joint Statistical
Meetings, so the reliance on SAS is not such a problem. However,
reliance on Microsoft Word is a larger problem, simply because it is
difficult to include dynamic output from statistical packages and the
emphasis on combining formatting and static text. The odfWeave
package in R attempts to overcome this difficulty with the Writer
package from Open Office, but I have had limited success using this
strategy. It is rather difficult to set up and does not seem to work
entirely as advertised on Windows.

Microsoft Word can include some dynamic data through the use of
linking. For example, a graph can be included in the document in such
a way that if the source image changes, the change is reflected in the
document as well. However, if the file is sent via email or secure
transfer to a collaborator or reviewer, the links are all broken and
do not show up properly. These links have to be broken manually (or,
perhaps, through a Visual Basic macro). Even then, if the reviewer
returns the document with comments or tracked changes, there are two
versions of the document: one with linked dynamic content and another
with comments and edits. Reconciling these documents can be a
challenge.

The LaTeX package is advantageous for these major reasons:

It supports the separation of content and formatting so the author
can focus solely on the content.
It is a free and mature package which can be installed on all major
computer platforms.
It is plain text, which supports the inclusion of dynamic content
through the weave packages discussed above.
LaTeX can easily include files with LaTeX commands. This makes it
possible, for example, to create a master document and subdocuments
that include different sections of a study report, for
example. Different people can be responsible for the subdocuments
without stepping on each other's toes. (Microsoft Word can do this
as well, but it seems to suffer from the same drawbacks as listed
above concerning dynamic content.)

The major disadvantages are as follows:

Adoption would represent a major paradigm shift for most companies
because they have already invested heavily in Microsoft Office
software and training
The LaTeX learning curve is rather steep
LaTeX does not seem to have the redlining and comment features of
Microsoft Word, although most free and mature version control
system, such as CVS and subversion, have sophisticated history
tracking that would make render the need for a redline feature
obsolete. Distribution of the document as a portable document
format (PDF) could enable the use of the Adobe Acrobat commenting
features, but this has drawbacks as well because comments from
multiple reviewers would be harder to consolidate. Other strategies
are listed at a StackOverflow page, but seem to be difficult to
implement across an enterprise.
Implementing some of these systems might be a challenge in light of
the validation requirements of 21 CFR 11 and the various
interpretations. It is well recognized that implementing a 21 CFR
11 compliant system for statistical production is a very difficult
exercise, and few systems comply completely.

The most difficult change, however, may have to do with how industry
thinks about reporting. We are used to the Microsoft Word approach to
reporting, with copying and pasting, editing format and content
concurrently, and passing redlined and commented files around via
email or secure server. Moving toward a reproducible research kind of
situation will necessarily require a shift in many of these habits
because they do not supported the automated updating that reproducible
research requires.

3 Why we should overcome those challenges

In a nutshell, the why of implementing reproducible research comes
down to the following:

saving time and money in the drug development and review process
presenting research in such a way to maximize the advancement of
science

The first reason, which I consider the business case for reproducible
research, comes from making the process of what we already have to do
(present data to the FDA in an open and transparent way) more
efficient. For example, if the FDA reviewer wants to reproduce a table
found in a study report, he or she will have to track down the
original data, see if there is a program that generates the data, and,
if not, try to reproduce the table via trial and error. Given that
statistical procedures require a lot of small decisions along the way,
this can be nearly impossible. I've found that trying to reproduce
someone else's results is one of the most difficult and least
desirable tasks in statistics, especially when the producer of the
original analysis is unavailable or uncooperative. It is also one of
the most time-consuming tasks. In a reproducible research situation,
the code is available in the document, and it is guaranteed that the
code shown in the document is that which produces the display.

The second reason, the scientific case, has to do with the fact that a
lot of effort can go into reproducing results, and that time can be
better spent extending results. With the code to reproduce a result
put next to the result, scientific consumers of the data can think of
new ways of considering the data. In the drug development context,
this is usually the sponsor or investigator, who might, for example,
mine the data for potential safety issues. This may prevent these
issues from being discovered for the first time during postmarketing
or marketing application review. The FDA reviewer might be able to
spend less effort on their reproduction, potentially shaving that time
off of review.

4 How to overcome those challenges

I don't have any one solution to the above problems, and in fact the
right solution is probably going to differ among organizations due to
different investments, culture, and expertise of the staff.

The first solution is to implement the free and open-source packages
of LaTeX, R, Sweave (included with R), and a version control
system. Frank Harrell at Vanderbilt university demonstrates this
system in the context of a data monitoring committee report. Alternatively, SAS can be used, but because I do not know the
maturity of Lenth's SASweave project I cannot vouch for it. REvolution Analytics's Enterprise R product can be used to assist in the
validation requirements and even some of the training, but it is
important not to underestimate the value already built up in SAS
training if an organization has been using SAS for years. (It is also
important to recognize the technical advantages SAS has over R.)

Another alternative is to use the odfWeave product. This is
attractive because the investment in Microsoft Word training can be
largely carried over to Open Office. This option requires the use of
R, because I have not seen a SAS odfWeave product. However,
programming this is not out of the question.

Another option I have not explored is an initiative by SAS to exchange information with Microsoft Office. This seems to be oriented toward a
business intelligence audience, and I have not explored whether it
would be appropriate in a clinical trial reporting situation. An
obvious advantage to this solution is that the software tools in which
most pharmaceutical and healthcare research companies have already
invested licensing and training fees will continue to be
used. However, I do not know how easy this will be to implement in a
clinical trial reporting setting or if it adheres to the principles of
reproducible research.

Another possible approach is the use of extensible markup language
(XML) standards, such as those hosted by the OASIS consortium. While I
have not explored these ideas fully, I have heard of one group
producing table shells (i.e. table layouts with placeholder
information) that are reused at the final table production
phase. Though that example is not quite reproducible research, the
fact that XML is based mostly on text makes it possible to convert it
easily to a web page, PDF document, or even RTF, and a weave package
for R or SAS could easily be written in much the same way of Sweave or
SASweave. I don't know of any implementation of these solutions, so
something based on these standards would have to be developed either
in-house (expensive) or externally (slowing adoption).

Finally, since Office 2007 Microsoft has moved (somewhat) toward
implementing open standards (e.g. XML) for its Office products. This
makes it possible, for example, for a WordWeave solution to be
written. Again, I don't know of an existing solution, so this route
would either lead to slow adoption or heavy investment.

In a clinical research organization (CRO) or larger pharmaceutical
company setting, it may be possible to select a small project in which
these ideas (those that do not require new software to be developed)
can be implemented. Those who are most familiar with the tools can be
selected to be on the project, and a computer can be dedicated
temporarily to housing the tools. A rough, abbreviated validation plan
can be implemented to assure basic quality measures are set up. If the
CRO has a project that is not GCP compliant, this will be perfect for
such an effort. Implementing in a small project is probably the best
approach because the project team can evaluate the new process and
identify the major organization-specific challenges in implementing
the changes across all projects. The major issue is going to be
sharing documents with the sponsor, vendors, and third-party
consultants. PDF with comment ability may be the best standard to use.

5 Conclusion

Reproducible research enables producers and consumers of clinical
trial data (usually, sponsors and the FDA) to trace quickly the flow
of data in such a way that reports are easily reproduced. The closer
we get to this ideal the more quickly the FDA can review data on a new
product, leading to either a quicker approval or quicker
identification of deficiencies in an application. Moving toward this
ideal will usually require a heavy investment of resources, but may be
worth it if products are approved more quickly.