Tuesday, December 27, 2011

Lots of open education resources for your gadgetry

While not related exclusively to statistics, this resource does relate to open education. This page at OpenCulture.com gives a large list of resources of books, courses, and media you can use to fill your new (or old) gadget. They have a list of free online courses as well, with statistics falling under computer science, engineering, and mathematics.

Sounds like Christmas and New Year’s resolution wrapped up into one nice gift!

Thursday, December 22, 2011

A statistician’s view of Stanford’s open Introduction to Databases class

In addition to the Introduction to Machine Learning class (which I have reviewed), I took a class on introduction to databases, taught by Prof. Jennifer Widom. This class consisted of video lectures, review questions, and exercises. Topics covered included XML (markup, DTDs, schema, XPath, XQuery, and XSLT) and relational databases (relational algebra, database normalization, SQL, constraints, triggers, views, authentication, online analytical processing, and recursion). At the end we had a quick lesson on NoSQL systems just to introduce the topic and discuss where they are appropriate.

This class was different in structure from the Machine Learning class in two ways: there were two exams and the potential for a statement of accomplishment.

I think any practicing statistician should learn at least the material on relational databases, because data storage and retrieval is such an important part of statistics today. Many different statistical packages now connect to relational databases through technologies like ODBC, and knowledge of SQL can enhance the use of these systems. For example, for subset analyses it is usually better to do subsetting with the database than pull all the data into the statistical analysis package and then perform the subset. In biostatistics, the data are usually collected in a database using an electronic data capture or paper-based system, which store the data in an Oracle database.

I found that I already use the material in this course, even though I typically don’t write SQL queries more complicated than a simple subset. Some of the examples for the exercises involved representing a social network, which may help when I do my own analyses of networks. Other examples were of relationships that you might find in biostatistics and other fields.

I found that by spending up to 5 hours a week on the class I was able to get a lot out of it. Unfortunately, Stanford is not offering this in the winter quarter, but they have promised to offer this again in the future. I heartily recommend it for anyone practicing statistics, and the price is right.

Tuesday, December 20, 2011

MIT launches online learning initiative

This may not relate directly to statistics, but it relate to my experiences in an online introductory machine learning class.

MIT has decided to launch online public classes of its own. It looks like they are making the platform open-source as well and encouraging other institutions to go online as well.

This may well be the most exciting development in education in years. Hopefully it won’t be too long before we see efforts to get hardware into disadvantaged neighborhoods or other countries (such as the one laptop per child project from a few years ago) combined with this online education.

For Stanford’s winter 2012 classes, check out Class Central.

Monday, December 19, 2011

A statistician’s view on Stanford’s public machine learning course

This past fall, I took Stanford’s class on machine learning. Overall, it was a terrific experience, and I’d like to share a few thoughts on it:

  • A lot of participants were concerned that it was a watered down version of Stanford’s CS229. And, in fact, the course was more limited in scope and more applied than the official Stanford class. However, I found this to be a strength. Because I was already familiar with most of the methods in the beginning (linear and multiple regression, logistic regression), I could focus more on the machine learning perspective that the class brought to these methods. This helped in later sections where I wasn’t so familiar with the methods.
  • The embedded review questions and the end of section review questions were very well done, with some randomization algorithm making it impossible to guess until everything was right.
  • Programming exercises were done in Octave, an open source Matlab-like programming environment. I really enjoyed doing this programming, because it meant I essentially programmed regression and logistic regression algorithms by hand with the exception of a numerical optimization algorithm. I got a huge confidence boost when I managed to get the backpropagation algorithm for neural networks correct. Emphasis on these exercises was on the loops, which you could code using “slow” loops (for loops, for instance), but then really needed to vectorize using the principles of linear algebra. For instance, there was an algorithm for a recommender system that would take hours if coded using for loops, but ran in minutes using a vectorized implementation. (This is because the implicit loops of vectorization were run using optimized linear algebra routines.) In statistics, we don’t always worry about implementation details so much, but in machine learning situations, implementation is important because these algorithms often need to run in real time.
  • The class encouraged me to look at the Kaggle competitions. I’m not doing terribly well in them, but now at least I’m hacking on some data myself and learning a lot in the process.
  • The structure of the public class helps a lot over, for example, the iTunes U version of the class. But now I’m looking at the CS 229 lectures on iTunes U and am understanding them a lot more now.
  • Kudos to Stanford for taking the lead on this effort. This is the next logical progression of distance education, and takes a lot of effort and time.

I also took the databases class, which was even more structured with a mid-term and final exam. This was a bit of a stretch for me, but learning about data storage and retrieval is a good complement to statistics and machine learning. I’ve coded a few complex SQL queries in my life, but this class really took my understanding of both XML-based and relational database systems to the next level.

Stanford is offering machine learning again, along with a gaggle of other classes. I recommend you check them out. (Find a list, for example, at the bottom of the page of Probabilistic Graph Models site.) (Note: Stanford does not offer official credit for these classes.)

Friday, December 16, 2011

It’s not every day a new statistical method is published in Science

I’ll have to check this out – Maximal Information-based Nonparametric Exploration (MINE - har har). The link to the paper in Science.

I haven’t looked at this very much yet. It appears to be a way of weeding out potential variable relationships for further exploration. Because it’s nonparametric, relationships don’t have to be linear, and spurious relationships are controlled with a false discovery rate method. A jar file and R file are both provided.

Monday, December 5, 2011

From datasets to algorithms in R

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

  • You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
  • You can specify interactions and nesting.
  • You can automatically create a numerical matrix for a formula using model.matrix(formula).
  • Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
    # predict.bic.glm
    #  Purpose: predict new values from a bma fit with values in a new dataframe
    # Arguments:
    #  bma.fit - an object fit by bic.glm using the formula method
    #  new.data - a data frame, which must have variables with the same names as the independent
    #             variables as was specified in the formula of bma.fit
    #             (it does not need the dependent variable, and ignores it if present)
    #  inv.link - a vectorized function representing the inverse of the link function
    # Returns:
    #  a vector of length nrow(new.data) with the conditional probabilities of the independent
    #  variable being 1 or TRUE
    # TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit$call
    form <- formula(bma.fit$call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
    des.matrix <- model.matrix(form,new.data)
    pred <- des.matrix %*% matrix(bma.fit$postmean,nc=1)
    pred <- inv.link(pred)

The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.

In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.


For a long time, I have relied on R-bloggers for new, interesting, arcane, and all around useful information related to R and statistics. Now my R-related material is appearing there.

If you use the R package at all, R-bloggers should be in your feed aggregator.

Tuesday, October 25, 2011

Named in Best Colleges top 50 statistics blogs of 2011!

Realizations in Biostatistics has been named in Best Colleges top 50 best statistics blogs of 2011!

The wide variety of content in this blog has been noted, and, yes, I do try to write about a lot of different aspects of statistics for technical and nontechnical audiences. In fact, my two most popular entries are O’Brien-Fleming Designs in Practice, intended for pharma professionals not versed in statistics, and My O’Brien-Fleming Design Is Not The Same as Your O’Brien-Fleming Design, which contains a technical issue I came across while trying to design one.

At any rate, there are several bloggers I really respect from that list—Ben Goldacre, Nathan Yau, John Sall, David Smith to name a few—and I feel very honored to be mentioned among them. Check out the list for some great statistics content!

Wednesday, September 14, 2011

Help! We need statistical leadership now! Part I: know your study

It’s time for statisticians to stand up and speak. This is a time where most scientific papers are “probably wrong,” and many of the reasons listed are statistical in nature. A recent paper in Nature Neuroscience noted a major statistical error in a disturbingly large number of papers. And a recent interview with Deborah Zarin, director of ClinicalTrials.gov, in Science revealed the very disturbing fact that many primary investigators and study statisticians did not understand their trial designs and the conclusions that can be drawn from them.

Recent focus on handling these problems have primary been concerned with financial conflicts of interest. Indeed, disclosure of financial conflicts of interest has only improved reporting of results. However, there are other sources of error that we have to consider.

A statistician responsible for a study has to be able to explain a study design and state what conclusions can be drawn from that design. I would prefer that we dig into that problem a little deeper and determine why this is occurring (and fix it!). I have a few hypotheses:

  • We are putting junior statisticians in positions of responsibility before they are experienced enough
  • Our emphasis on classical statistics fills a lot of our education, but is insufficient for current clinical trial needs involving adaptive trials, modern dose-finding, or comparison of interactions
  • The demand for statistical services is so high, and the supply so low, that statisticians are spread out too thin and simply don’t have the time to put in the sophisticated thought required for these studies
  • Statisticians feel hamstrung by the need to explain everything to their non-statistical colleagues and lack the common language, time, or concentration ability to do so effectively

I’ve certainly encountered all of these different situations.

Tuesday, September 13, 2011

The statistical significance of interactions

Nature Neuroscience recently pointed out a statistical error that has occurred over and over in science journals. Ben Goldacre explains the error in a little detail, and gives his cynical interpretation. Of course, I’ll apply Hanlon’s razor to the situation (unlike Ben), but I do want to explain the impact of these errors.

It’s easy to forget when you’re working with numbers that you are trying to explain what’s going on in the world. In biostatistics, you try to explain what’s going on in the human body. If you’re studying drugs, statistical errors affect people’s lives.

Where these studies are published is also important. Nature Neuroscience is a widely read journal, and a large number of articles in this publication commit the error. I wonder how many articles in therapeutic area journals make this mistake. These are the journals that affect day to day medical practice, and, if the Nature Neuroscience error rate holds, this is disturbing indeed.

Honestly, when I read these linked articles, I was dismayed but not surprised. We statisticians often give confusing advice on how to test for differences and probably overplay the meaning of statistically significant.

We statisticians have to assume the leadership position here in the design, analysis, and interpretation of statistical analysis.

Monday, September 12, 2011

R to Word, revisited

In a previous post (a long time ago) I discussed a way to get a R data frame into a Word table. The code in that entry was essentially a brute force way of wrapping R data in RTF code, but that RTF code was the bare minimum. There was no optimization of widths, or borders, or anything like that.
There are a couple of other ways I can think of:
  • Writing to CSV, then opening in MS Excel, then copying/pasting into Word (or even saving in another format)
  • Going through HTML to get to Word (for example, the R2HTML package)
  • Using a commercial solution such as Inference for R (has this been updated since 2009?!)
  • Using Statconn, which seems broken in the later versions of Windows and is not cross platform in any case
  • Going through LaTeX to RTF
I’ve started to look at the last option a bit more, mainly because LaTeX2RTF has become a more powerful program. Here’s my current workflow from R to RTF:
  • Use cat statements and the latex command from the Hmisc package (included with every R installation) or xtable package (downloadable from CRAN) to create a LaTeX document with the table.
  • Call the l2r command included with the LaTeX2RTF installation. (This requires a bit of setup, see below.)
  • Open up the result in Word and do any further manipulation.
The advantages to this approach are basically that you can tune the appearance of the table from R rather than through post-processing of the RTF file. The main disadvantage is that you have to do a lot of editing of the l2r.bat file (if you are on Windows) to point to the correct directories of MiKTeX, Imagemagick, and Ghostscript. (And you have to make sure everything is installed correctly as well.) It’s rather tricky because you have to include quotes in the right places. However, that setup only has to occur once. The second disadvantage is that if you use the system() command to call l2r.bat, you have to use a strange syntax, such as system(paste('"c:\\Program Files\\latex2rtf\\l2r"',"c:\\path\\to\\routput.tex")). I suppose you could wrap this up in a neat function. At any rate, I think the effort is worth it, because the output is rather nice.
I must admit, though, here is one area where SAS blows R away. With the exception of integration into LaTeX, and some semisuccessful integration with Excel through the Statconn project, R’s reporting capabilities in terms of nicely formatted output is seriously outpaced by SAS ODS.

Saturday, August 6, 2011

Has statistics become a dirty word?

I reflect a little on the question over at the Statistics Forum.

Joint statistical meetings 2011 a reflection

I attended the Joint Statistical Meetings 2011 this year, and I think it marked a turning point for me as a professional. In years past, I diligently attended something every session for something interesting. This year I was determined to broaden my horizons, and ended up volunteering for best contributed paper coordinator of the biopharmaceutical section. I think I went to only three or four talks, but a learned a whole lot more from the one-on-one conversations and posters. I wrote up most of my thoughts over at the Statistics Forum, but I just want to reinforce the idea that most of my value from the conference came from the lunches and chatting that went on in between sessions. My only regret was not seeing the Google and Facebook talks.

And I managed to avoid a sunburn from the Miami sun this time.

Sunday, June 26, 2011

Watch this space: Data Without Borders

Data scientist Jake Porway is launching a new service called Data Without Borders, which seeks to match statisticians and nonprofits / NGOs:

If you are a data scientist / statistician who wants to do some good, check it out! I think he is also looking for a web designer, and I imagine there will need to be some coordination and management beyond Jake’s initial inspiration.

Sunday, March 27, 2011

More on Ultraedit to R

I think I have found a superior solution to integrating Ultraedit and R. See my post over at IDM Users forum.

Sunday, February 27, 2011

Who tweets about pharma?

I have been playing around with the Python language for a month or two, and really enjoy it. Specifically, I have been using Matthew Russell’s book Mining the Social Web (no financial relation here, just a happy customer) to find out how to do my own custom social network analysis. I’ve really only dipped my toes in it, but even the simple stuff can yield great insights. (Perhaps that’s the greatest insight I’m getting from this book!)

At any rate, I basically compiled all the material from Chapter 1 into one program, and searched for the term #pharma. I came up with the following:

Search term: #pharma
Total words: 7685
Total unique words: 3142
Lexical diversity: 0.408848
Avg words per tweet: 15.370000
The 50 most frequent words are
[u'#pharma', u'to', u'-', u'RT', u'#Pharma', u'in', u'for', u'the', u'and', u'of
', u':', u'a', u'on', u'&amp;', u'#Job', u'is', u'#jobs', u'Pharma', u'#health',
u'#healthinnovations', u'\u2013', u'The', u'at', u'#hcmktg', u'Pfizer', u'from'
, u'US', u'with', u'#biotech', u'#nhs', u'de', u'#drug', u'drug', u'
http://bit.ly/PHARMA', u'this', u'#advertising', u'...', u'FDA', u'#Roche', u'#healthjobs',
u'an', u'Medical', u'that', u'AdPharm', u'them', u'you', u'#medicament', u'Big',
u'ouch', u'post']
The 50 least frequent words are
[u'verteidigen', u'video', u'videos,', u'vids', u'viral', u'visiting', u'vom', u
'waiting', u'wave', u'way:', u'weakest', u'website.', u'week', u'weekend,', u'we
ekly', u'werden', u'whether', u'wholesaler', u'willingness', u'winning...', u'wi
sh', u'work?', u'workin...', u'working', u'works', u'works:', u'world', u'worldw
ide-', u'write', u'wundern', u'www.adecco.de/offic...', u'year', u'years,', u'ye
t?', u"you'll", u'you?', u'youtube', u'you\u2019re', u'zone', u'zu', u'zur', u'{
patients}!?', u'\x96', u'\xa360-80K', u'\xbb', u'\xc0', u'\xe9cart\xe9s', u'\xe9
t\xe9', u'\xe9t\xe9...', u'\xfcber']
Number of users involved in RT relationships for #pharma: 158
Number of RT relationships: 108
Average number of RTs per user: 1.462963
Number of subgraphs: 52

(The term u'<stuff>' above simply means that the words in quotes are stored in unicode, which you can safely ignore here.)

As for graphing the retweet relationships, I came up with the following:


So whom should you follow if you want pharma news? Probably you can start with those tweeters at the centers of wheels with a lot of spokes. What are #pharma tweeters concerned about these days? Jobs, jobs, jobs, and maybe some innovations.

Update: I checked the picture and saw that it was downsized into illegibility. So I uploaded an svg version.

Thursday, January 27, 2011

Coping with lower-truncated assay values in SAS and R


Many assays, for example ones designed to detect a certain antibody, cannot detect values below a certain level (called the “lower level of quantification,” or LLQ). Summarizing these is problematic, because if you summarize without these problematic values, you clearly estimate the mean too high and the standard deviation too low. However, if you substitute 0, you underestimate the mean and overestimate the standard deviation.

One solution is maximum likelihood. It is fairly straightforward to write down the loglikelihood function:

L(x|μ,σ,LLQ) = Πi=1nφ(xi;μ,σ)*Φ(LLQ;μ,σ)#(LLQ)

where φ() is the normal probability distribution function and Φ is the normal cumulative distribution function. What this likelihood essentially says is if the data point is real, use it (by contributing the value of the density function at that data value), but if it is below LLQ, we substitute in the probability of getting less than LLQ into the likelihood.

I looked around for a way to do this in SAS, and I found 3. The first is use SAS/IML with one of its optimization routines (such as nlpdd()). The second is to use SAS/OR with PROC NLP. The third way, and my preferred way at the moment, is to use SAS/STAT with PROC NLMIXED (SAS/STAT will be found at nearly every SAS installation, whereas IML and OR are not always).

To do this in SAS:

%let llq=1; * this is the lower detectable level ;

* read in the data in a useful way. I tried using the special missing value .L for LLQ but NLMIXED below didn’t like it. Substitute 0 with your value of choice if this doesn’t work;

proc FORMAT;
invalue llqi

DATA foo;
input dat :llqi.;

* define the loglikelihood and find its maximum ;

proc NLMIXED data=foo;
parms mu=1 sigma=1;
model dat ~ general(ll);

So the original purpose of NLMIXED is to fit nonlinear mixed models, but the optimization routines that power NLMIXED are very similar to the ones that power NLP. Second, the developers exposed those optimization routines to the user through the general() modifier in the model statement. So we define the loglikelihood data point by data point (so using φ(xi;μ,σ) if it is a normal data point or Φ(LLQ;μ,σ) if it is below LLQ) and then basically state that our data is distributed according to this loglikelihood.

To do this in R, the optimization is a little more direct:

dat <- c(1,NA,2,3,NA)

ll <- function(x,dat,llq) {

mu <- x[1]




So, we define the data, define the loglikelihood from more of a bird’s eye point of view (again, using the pdf at the real data values, and counting up the number of below LLQ and multiplying by the probability of below LLQ).

I checked, and they give the same answer. To get standard errors, compute the Hessian, but this should get you started. Both NLMIXED and optim should give those.

Another caveat: if your sample size is small and your distribution isn’t normal, you will have to work a bit harder and get the correct distribution to use in the likelihood.

Tuesday, January 25, 2011

Tables to graphs

Here is an interesting site on how to convert tables that make your eyes bleed to more pleasant and interpretable graphs. It seems to have more of a political science focus (and links extensively to books by Andrew Gelman), and I think the principles apply to most any area.

Sunday, January 23, 2011

Status of postmarketing commitments by product approval year

Because it didn't publish correctly, here's the key:

  • P-Pending
  • D-Delayed
  • F-Fulfilled
  • O-Ongoing
  • R-Released (i.e. FDA no longer requires it
  • S-Submitted
  • T-Terminated

Friday, January 21, 2011

List of statistics blogs

Here is a list of 40 statistics blogs. I read many of them regularly, but there are a few that are new to me.

Ultraedit to R

My favorite text editor on Windows is Ultraedit, but it does not have a nice interface to R in the same vein as Emacs/ESSTinn-R, or Eclipse. (I have never used Eclipse.) Ultraedit is powerful enough to submit whole R programs and even capture the output, but you cannot submit pieces and parts of programs.

Until now. After one night of dusting off my C skills and learning a bit about how to use winsock, I have created a little command line tool that will send code to R and get back the output. It's very rough around the edges, but it works.

You may also be interested in this forum thread.

Please read the README before using, as it will give you some minimal instructions on setup. Notably, the svSocket package is required (it's part of Sciviews, linked above, and available on CRAN).

If you use Ultraedit and R, I think you will find it helpful even with the rough edges. Feedback is appreciated, either here or the linked thread.

Saturday, January 15, 2011

Dona eis Python, whap!

Well, I'm taking the plunge and learning Python. We'll see how this goes. Then I'll try NumPy (and SciPy, if it gets ported), and see of I can get R and Python talking.

Saturday, January 1, 2011

Top Content of 2010

According to Google Analytics, my top content of 2010 was
O'Brien Fleming Designs In Practice
My O'Brien Fleming Design Is Not Same As Your O'Brien-Fleming Design
Matrix Square Roots In R
Barnard's Exact Test: Test That Ought To Be Used More
Perverse Incentives In Clinical Trials
Bayesian Information Criterion
Review Of Jim Albert's Bayesian Computing in R
NNT Returns
Note that views of the front page are excluded, but as that content rotates it is hard to interpret that. I have put some easy feedback mechanisms for each post (including the rating scale), but these have not returned very much information.
None of this surprises me. It seems like O'Brien Fleming designs are the top content just about any time I check, and I think this is because adaptive trials are becoming more popular. The FDA's draft guidance on the subject was released, and I guess O'Brien-Fleming designs are the most popular type of adaptive trial at least in the confirmatory stages of drug development.
I'm still surprised that matrix square roots in R is next, because the Denman-Beavers algorithm is fairly inefficient unless you need the square root and its inverse. Some method based on eigenvalues or singular value decomposition is probably better, but I guess occasionally you run across the odd matrix that has a rank 2 or higher subspace in its eigendecomposition. I'm glad to see Barnard's test getting a little exposure if at least to challenged the flawed overapplication of Fisher's exact test. I'm surprised Perverse Incentives and NNT did not rank higher, as those topics I think are of wider interest, but there you go. The BIC is a fairly specialized topic, and while model selection is of wide interest the methods I used that involved the BIC are rather specialized.
As a note, I write the review of Jim Albert's book before the second edition came out. I have the second edition now, and I should probably get around to updating the review.
As any reader of this blog might be able to tell, my focus is probably going to change a bit as I get more comfortable in the post-marketing world of drug development. The issues change, the quality and kinds of data change, and so my day-to-day issues will change. We'll see how it unfolds.
Of course, I'm open to suggestions as well.

Happy New Year 2011

Thank you Bill Watterson!
This year, 2011, is a prime number, and the first prime numbered-year since 2003. It is also the sum of a prime number (11) of consecutive primes. More numerology here.
Last year was a year of changes, some of which were not easy but all of which have produced good results 2011 will probably be more about strengthening current initiatives (including the blogging one!) than about starting new things, but you never know. The only thing I can imagine along those lines would be doing some freelance consulting here and there.
It's worth noting that my main goals are process-oriented and so really are more intentions than resolutions, and that strategy has worked fairly well. (The main one that's results-oriented is going on a big vacation for my 10th anniversary.)
At any rate, I'm an analyst in an increasingly analytical world, so look for more analysis!