Realizations in Biostatistics: 2013

Wednesday, August 7, 2013

Joint statistical meetings 2013

Every year, the first week of August, we statisticians meet to get our statistics, networking, dancing, and beer on. With thousands in attendance, it's exhausting. I wonder about the quality of statistical work the second week of August.

Each conference seems to have a life of its own, so I tend to reflect on each one. Here's my reflection on this year's:

First, being in Montreal, most of us couldn't use smartphones. Thankfully, Revolution Analytics sponsored free WiFi. They also do great work with R. So we were all for the most part able to tweet.

The quality of talks was pretty good this year, and I've learned a lot. We even had one person describe simulations with a flowchart rather than indecipherable equations, and I strongly encourage that practice.

As a member of the biopharmaceutical section, I was struck by how few people take advantage of our awards. Of course, everybody giving a contributed or topic contributed talks is automatically entered into the best contributed paper competition. But we have a poster competition and student paper competition that have to be explicitly entered, and participation is low. This is a great opportunity.

The highlight of the conference, of course, was Nate Silver's talk, and he delivered admirably. The perhaps thousand statisticians in attendance needed the message: learn to communicate with journalists and teach them numbers need context. I also like his response to the question "statistician or data scientist?" Which was, of course, "I don't care what you call yourself, just do good work."

Monday, July 15, 2013

Wasserman on noninformative priors

Larry Wasserman calls the use of noninformative priors a “lost cause.” I agree for the reasons he stated, and the fact that there are always better alternatives anyway. At the very least, there are the heavy-tailed “weakly informative priors” that put nearly all weight on something reasonable, such as small to moderate values of a variance, and little weight on stupid prior values, such as mean values on the order of 10¹⁰⁰.

However, they’ll be around for years to come. Noninformative priors are nice security blankets, and we get to think that we are approaching a problem with an open mind. I guess open minds can have stupid properties as well.

I hope, though, that we will start thinking more deeply about the consequences of our assumptions especially about noninformative priors rather than feeling nice about them.

Sunday, April 28, 2013

MOOCs–a low-risk way to explore outside your field

One of the things I'm realizing from Massively Open Online Courses (MOOCs) -- those online free classes from universities that have seem to sprung up from almost nowhere in the last year and a half -- is that they offer a perfect opportunity to explore outside my field. At first (and this was even before the term MOOC was coined), I took classes there were just outside my field. For instance, I've been in clinical and postmarketing pharmaceutical statistics for over 10 years, and my first two classes were in databases and machine learning. I did this because I was aching to learn something new, but I figured that with a class in databases I could make our database guys in IT sweat a bit just by dropping some terms and showing some understanding of the basics. It worked. In addition, I wanted to understand what this machine learning field was all about, and how it was different from statistics. I accomplished that goal, too.

Since then, I have taken courses in the area of artificial intelligence/machine learning, sociology and networks, scientific computing (separately from statistical computing), and even entrepreneurship. I have also encouraged others to take part in MOOCs, though I don't know the result of that. Finally, I have come back to some classes I've already taken as a community TA, or former student who actively takes part in discussions to help new students take the class.

This is all valuable experience, and I could write several blog entries on the benefits. The main one I'm feeling right now is the feeling that I'm coming up for air, and taking a sampling of other points of view in a low-risk way. For example, though I don't actively use Fourier analysis in my own work, one recent class and one current class both use it to do different things (solve differential equations and process signals). Because these classes involve programming assignments, I've now deepened my understanding of the spectral theorem, which I only studied from a theoretical point of view in graduate school. I'm also thinking about this work from the point of view of time series analysis, which is helping me think about some problems involving longitudinal data at work.

From a completely different standpoint, another class helped me think about salary negotiations in terms of expected payoff (i.e. combination of probability of an offer being accepted vs. salary). This sort of analysis invited further analysis of the value of that job vs. what I would be paid and the insecurity of moving to a different job. In the end, I turned down what would have been a pretty good offer, because I decided it did not compensate for the risks I was incurring. The cool thing is that these were all applying concepts I already understood (expected value, expected payoff), but applied in a different way from what I was already doing.

The best thing about MOOCs is that the risk is low. All that is required is an internet connection and a decent computer. Some math courses may require a better computer to do high-powered math, but I've seen few that require expensive textbooks or expensive software. Even Mathworks is now offering Matlab at student pricing to people who are taking some classes, and Octave remains a free option for people unable to take advantage of it. And, if you are unable to keep up the work, there is now downside. You can simply unenroll.

Monday, April 15, 2013

RStudio is reminding me of the older Macs

The only thing missing is the cryptic ID number.

Well, the only bad thing is that I am trying to run a probabilistic graphical model on some real data, and having a crash like this will definitely slow things down.

Saturday, March 30, 2013

Presenting without slides

Tired of slides, I’ve been experimenting with different ways of presenting. At the recent Conference on Statistical Practice, I decided only to use slides for an outline and references. As it turns out, the most critical feedback I got had to do with the fact that the audience couldn’t follow the organization because I had no slides.

I tried presenting without slides because, well, I started to use them as a crutch. I also saw a lot of people presenting essentially by putting together slides and reading from them. So I figured I would expand my horizons.

Next time I present, I’ll do slides, I guess, but I may try something a bit different.

Wednesday, March 27, 2013

Last session of Caltech's Learning from Data course starts April 2

I just received this email:

Caltech's Machine Learning MOOC is coming to an end this spring, with the final session starting on April 2. There will be no future sessions. The course has attracted more than 200,000 participants since its launch last year, and has gained wide acclaim. This is the last chance for anyone who wishes to take the course (http://work.caltech.edu/telecourse).
Best.
The Caltech Team

I strongly recommend this course if you can take it, even if you have taken other machine learning classes. It lays a great theoretical foundation for machine learning, sets it off nicely from classical statistics, and gives you some experience working with data as well.

If you were for some reason waiting for the right time, it looks to be now or never.

Wednesday, March 20, 2013

Review of Caltech's Learning from Data e-course

Caltech has an online course Learning from Data, taught by Professor Yaser Abu-Mostafa, that seeks to make the course material accessible to everybody. Unlike most of the online courses I've taken, this one is independently offered through a platform created just for the class. I took the course for its second offering in Jan-March 2013.

The platform on which the course is offered isn't as slick as Coursera. The lectures are offered through a Youtube playlist, and the homeworks are graded through multiple choice. That's perhaps a weakness of the class, but somehow the course faculty made it work.

The class's content was its strong point. Abu-Mostafa weaved theory and pragmatic concerns throughout the class, and invited students to write code in just about any platform (I, of course, chose R) to explore the theoretical ideas in a practical setting. Between this class and Andrew Ng's Machine Learning class on the Coursera platform, a student will have a very strong foundation to apply these techniques to a real-world setting.

I have only one objection to the content, which came in the last lecture. In his description of Bayesian techniques, he claimed that in most circumstances you could only model a parameter with a delta function. This, of course, falls in line with the frequentist notion that you have a constant, but unknowable "state of nature." I felt this way for a long time, but don't really believe it any more in a variety of contexts. I think he played up the Bayesian v. frequentist squabble a bit much, which may have been appropriate 20 years ago but is not so much an issue now.

Otherwise, I found the perspective from the course extremely valuable, especially in the context of supervised learning.

If you plan on taking the course, I recommend leaving a lot of time for it or having a very strong statistical background.

Tuesday, March 12, 2013

Distrust of R

I guess I've been living in a bubble for a bit, but apparently there are a lot of people who still mistrust R. I got asked this week why I used R (and, specifically, the package rpart) to generate classification and regression trees instead of SAS Enterprise Miner. Never mind the fact that rpart code has been around a very long time, and probably has been subject to more scrutiny than any other decision tree code. (And never mind the fact that I really don't like classification and regression trees in general because of their limitations.)

At any rate, if someone wants to pay the big bucks for me to use SAS Enterprise Miner just on their project, they can go right ahead. Otherwise, I have got a bit of convincing to do.

Thursday, February 28, 2013

Bad statistics in high impact journals

Better Journals… Worse Statistics? : Neuroskeptic

In the linked blog entry, Neuroskeptic notes that high impact journals often have fewer statistical details than other journals. The research reported in these journals is often heavily amended, if not outright contradicted, by later research. I don't think this is nefarious, though, nor is it worthless. The kind of work reported in Science and Nature, for instance, generates interest and, therefore, more scrutiny (funding, studies, theses, etc.).

But as with all other research, if statistical details are included it might direct subsequent research in these topics a bit better.

Wednesday, February 20, 2013

The burst of the Big Data bubble, and do we need the hype, anyway?

So, now I'm seeing some buzz over Twitter that the Big Data disillusionment is starting now. Frankly, I've been wondering when this would happen. Of course, the next stage involves making strategic investments in Big Data resources, and having these resources quietly being used effectively, at least if Big Data follows technologies such as neural networks, Java, etc. So the theory goes, all surviving technologies follow a pattern of hype, disillusionment, and then quiet acceptance.

Did we really need this period of hype? I can understand companies hype up a technology to maintain interest while they try to make their offerings mature, and overhyping usually leads to disillusionment, but I wonder if there is a different path. R, Python, and some other open projects seem to have flattened the hype hill and disillusionment valley, probably because the larger number of people hacking the inside generates its own interest and maturity mechanism.

Anyway, I look forward to the maturing of big data at least until the privacy concerns generate widespread panic.

Friday, February 15, 2013

Sloppy journalism with interactive graphics is still sloppy journalism

The Guardian recently discussed the "declining linguistic standards" in State of the Union addresses. I thought this was an interesting exercise, but something seemed wrong about the article, and it turns out this is one case where the data do not really speak for themselves. There's a lot of interpretation and understanding behind cultural trends in the use of the English language in America, as well as the evolution of the presidents' intentions behind the address. There are a few important points:

The author correctly points out that Woodrow Wilson essentially changed the format of the address through precedent from written document to speech. Right after Wilson's first speech there is a huge drop in the "education level" (hang on for a discussion of this terminology) of these addresses. As I recall, Wilson is the only American president with a Ph.D.
The index used - Flesch-Kincaid (FK), is questionable. Good on The Guardian to use a single measure for all speeches, but I have to wonder if it is wise to use the same measure for speeches and written addresses. Furthermore, FK is very sensitive to the placement of punctuation (it weights sentence length heavily). For instance, as a friend pointed out, one of Wilson's speeches has a FK grade level of over 17, but if you replace one of the semi-colons in the speech with a period, the FK grade drops to 12. This subtlety is lost in speech format, giving FK an extremely high uncertainty (this same friend calls FK "utterly useless" for speeches).
The audience of the SOTU address has changed. Though it's a constitutional duty of the president, the delivery as a speech is not, and it only has to be delivered to Congress. However, most modern addresses have been in the form of televised speeches, and have to be understood by a wider and less politically savvy audience.
Cultural trends in the use of spoken and written English in America involve shorter sentences over time in general.
In this case, a more sophisticated natural language processing analysis might reveal some interesting trends. For instance, how do wartime speeches compare to times of peace? Are there any natural categories of speeches that fall out? What are the outliers? How does this compare to polls?

In short, we have some interesting data that needs heavy qualification and critical analysis, that is just presented on a page and capped with a headline that gives an overly simplistic interpretation.

Monday, February 11, 2013

Operational details can be pesky

Recently, I was working with a team to finalize a clinical trial protocol. I raised some concerns about their strategic matters, and my concerns were dismissed as "operational details."

The thing about those pesky operational details is that, if something doesn't work due to an operational detail, you might have to modify your strategy. And if enough of these pesky operational details get in the way, you may have to rethink your strategy.