Wednesday, December 31, 2014

Algorithmic cruelty

By now, most of us know about Facebook’s algorithmic retrospectives, and, of course, how some people thought it could be cruel. Indeed, posts about some sorts of issues, such as divorces or deaths, can get a lot of “likes” (where the “like” button means something other than “like”) and comments, and therefore get flagged by whatever algorithm the Facebook data scientists came up with as a post worthy of a retrospective.

There are a lot of issues here. When someone “likes” a post, they do not necessarily mean they “like” the event the post is about. It could mean a number of different things, such as “I hear you,” or “I’m empathizing with you,” or even “Hang in there.” However, the algorithms treat all likes equal.

Comments, of course, carry much more sophisticated meaning, but are much harder to analyze especially in the presence of sarcasm. And algorithms that do analyze comments (or any free text) for sentiment will require a large training set of hand-coded comments. (Which I suppose Facebook does have the resources to generate.)

Which leaves a few ways of handling this problem:

  • Do nothing different. Which is probably my favorite solution, because I’d like to look back on the good, the bad, and the ugly. It’s my life, and I want to remember it. Besides, the event that really sucked at the time (say, a torn ACL leading to surgery) may lead to good things.
  • Add a “I don’t want to see this” button. Which was already accomplished by including an “X” button, but maybe not so obvious.
  • Eliminate the retrospective, which I don’t think anybody agrees is a good solution.

I suppose one day Facebook’s algorithm will be smart enough to withhold posts it knows people don’t want to review, but then that will open up another can of worms.

Tuesday, December 9, 2014

No, a study did not link genetically engineered crops to 22 diseases

In my Facebook feed, a friend posted a very scary-looking study that links genetically engineered (GE) crops to the rise in 22 diseases. These are pretty fearsome diseases, too, like bile duct cancer and pelvis cancer. For instance1:

There are a few ways to respond to this article:

First, it has not passed my attention that the second author has published a book Myths of Safe Pesticides, which has been analyzed and debunked by Harriet Hall.

Second, I could just say "correlation is not causation." QED. Article debunked, and can be swept to the dustbin.

Third, I can point out the correlation between sales of organic produce and autism. (Yikes!) In fact, using the methods of this article, I can probably prove a significant correlation between sales of organic produce and bile cancer, kidney cancer, autism, or lipoprotein disorder deaths. We can all grab our glyphosate-coated pitchforks and demand reform!

However, I think there are some statistical lessons here, and it's sometimes good to deconstruct some misused and abused statistics. And trust me, the statistics in this article are seriously misused. In fact, it might be an interesting project for an introductory graduate statistics class to collect articles like this and critique them. I'll do it for fun here. There are others that can speak to the scientific aspects of the article, like how it disagrees with the results of a review of over a trillion meals fed that incorporate GE products. There's also other quibbles with the article, like how it sometimes conflates pesticide discussions with glyphosate (an herbicide), that others can deconstruct.

When deciding on how to summarize and analyze data statistically, it is essential to work with the nature of the data. This article fails on several counts. First, it smashes together data from two complete different sources without considering how the data are related. Now, I'm generally excited to see data from disparate sources linked and analyzed together, but it has to be done carefully. This is how they obtained their data on GE use:

From 1990-2002, glyphosate data were available for all three crops, but beginning in 2003 data were not collected for all three crops in any given year. Data on the application rates were interpolated for the missing years by plotting and calculating a best fit curve. Results for the application rates for soy and corn are shown in Figures 2 and 3. Because the PAT was relatively small prior to about 1995, the sampling errors are much larger for pre-1995 data, more so for corn than for soy. Also, data were not missing until 2003 for soy and 2004 for corn. For these reasons, the interpolated curves begin in 1996 for soy and 1997 for corn in Figures 2 and 3.

This is how they obtained epidemiological data:

Databases were searched for epidemiological data on diseases that might have a correlation to glyphosate use and/or GE crop growth based on information given in the introduction. The primary source for these data was the Centers for Disease Control and Prevention (CDC). These data were plotted against the amount of glyphosate applied to corn and soy from Figure 6 and the total %GE corn and soy crops planted from Figure 1. The percentage of GE corn and soy planted is given by: (total estimated number of acres of GE soy + total estimated number of acres of GE corn)/(total Estimated acres of soy + total estimated acres of corn)x100, where the estimated numbers were obtained from the USDA as outlined above.

This seems innocent enough, but there's already a lot of wrong happening here. It's good that they explained some of their data cleaning, though we can always stand for more transparency behind this step. It's not scientifically glorious to describe how you handle missing or sparse data, but mishandling such can certainly sink your Nobel prize work. It's also good to explain derived variables, though I haven't gone back and checked their math.

The first fatal error is how they link the data. They simply merge it by year. It's the obvious-seeming step that already tanks their analysis. This is the same kind of merging that links, say, sales of organic crops to autism. Mashing up data needs to be done in a scientifically valid way, and simply merging disparate data by year isn't going to cut it here. All these data they gathered are crude summaries, and they just strung them together by year without giving any thought to whether the subjects in the epidemiological database have any connection to the subjects in the GE database. Sloppy, and that right there can be enough to tank any analysis, even if the analysis were well done. Which this one wasn't.

The second fatal error is how they present the data. Take the Figure 16 above. This graph breaks so many rules of data presentation that Edward Tufte's head would probably explode just from looking at it. But let's dig a little deeper. The authors say they plotted incidence of disease (in Figure 16 it's age-adjusted deaths due to lipoprotein disorder) against GE and glyphosate use. However, if you want to get technical about it, they plot all three of these versus time. This is a very important distinction. If they plotted incidence versus GE use, then they would put GE use on the x-axis. However, they show incidence in bar graphs by time, GE use in a line graph by time, and glyphosate use by time. I'll explain why this is important in the discussion of the third fatal flaw. But let's move ahead with the graph. From what I've been able to figure out, the left y-axis goes with the bar graph and is in deaths per hundred thousand. The axis on the right does double duty and covers both % of GE planted and 1000 tons of glyphosate used. It took me a while to figure that out, and it's very sloppy design anyway (the two scales have nothing to do with each other). If you ever see a line plot with a left and right y-axis, get skeptical. Here, the left axis starts at 0 and ends at 2.75 or so, and the right axis starts at -20 (!) and ends at 85 or so. I can see why they chose the y-axis, but the right axis is very curious. The -20 is a terrible choice for the start of the right axis. It's an invalid value of % of GE crops planted and 1000s of tons of glyphosate used. “Yes, Monsanto, I used -20,000 tons of glyphosate. You owe me $50,000.” It seems that the origin and scale of the right y-axis was chosen specifically to make GE and glyphosate use appear to track closely with deaths. I usually choose incompetence over malice to explain motivations, but it's very challenging to support incompetence in this case. It takes talent and/or effort to choose axes like this. I'll leave a deconstruction of the other graphs as an exercise, perhaps for your graduate-level stats class.

The third and final fatal error is how they analyze the data. Their analysis is the statistical equivalent to bringing a knife to a gunfight. They basically take all the GE and epidemiological data, ignore the time component, and send it through your Stat 101 Pearson correlation estimator formula. They construct some p-values, unsurprisingly find a massively small p-value, declare victory, and hit the publish button. Problem is, they compute the wrong statistical summary using the wrong formula and use it to make the wrong inference. The Pearson correlation estimator they use is designed for independent data, not time series data (and they know it's time series data because they say so on p. 11). Time series data has a complex correlation structure, and thus estimating second-order parameters like correlations is a bit of a challenge. For instance, GE use this year is going to be heavily correlated to GE use last year, as are deaths from lipoprotein disorders. Does the correlation reflect a relationship between death and GE use, or death this year and death last year? The naïve estimate assumes the correlation is between death and GE use, and accounts nothing of the relationship between deaths this year and last year (in the stat world we call this autocorrelation). Though I haven't done the math, my guess is that the correlation between death and GE use will be greatly reduced if not disappear altogether if time is taken into account. And even if there is a nonzero, significant correlation, the fact of the matter is that there needs to be a stronger link than time between the GE data and epidemiological data.

As a bonus, the paper claims to find a link between GE crop use, glyphosate use, and a whole bunch of nasty stuff, but they never try to tease out whether the nasty stuff is attributable to glyphosate or GE crops.

In conclusion, the paper claims to find a strong link between GE crop use and glyphosate use, and a host of diseases. Given that their paper was so deeply methodologically flawed, they are unable to support their conclusions. This paper should not be considered as evidence of the dangers of GE crop use or glyphosphate use, but should rather be used as a showcase of "How Not to Do It."

Edit: I need to learn how to spell glyphosate.

1Swanson, Leu, Abrahamson, and Wallet. "Genetically engineered crops, glyphosphate and the deterioration of health in the United States." Journal of Organic Systems. 9(2), 2014. Figure 16.

Wednesday, August 7, 2013

Joint statistical meetings 2013

Every year, the first week of August, we statisticians meet to get our statistics, networking, dancing, and beer on. With thousands in attendance, it's exhausting. I wonder about the quality of statistical work the second week of August.

Each conference seems to have a life of its own, so I tend to reflect on each one. Here's my reflection on this year's:

First, being in Montreal, most of us couldn't use smartphones. Thankfully, Revolution Analytics sponsored free WiFi. They also do great work with R. So we were all for the most part able to tweet.

The quality of talks was pretty good this year, and I've learned a lot. We even had one person describe simulations with a flowchart rather than indecipherable equations, and I strongly encourage that practice.

As a member of the biopharmaceutical section, I was struck by how few people take advantage of our awards. Of course, everybody giving a contributed or topic contributed talks is automatically entered into the best contributed paper competition. But we have a poster competition and student paper competition that have to be explicitly entered, and participation is low. This is a great opportunity.

The highlight of the conference, of course, was Nate Silver's talk, and he delivered admirably. The perhaps thousand statisticians in attendance needed the message: learn to communicate with journalists and teach them numbers need context. I also like his response to the question "statistician or data scientist?" Which was, of course, "I don't care what you call yourself, just do good work."

Monday, July 15, 2013

Wasserman on noninformative priors

Larry Wasserman calls the use of noninformative priors a “lost cause.” I agree for the reasons he stated, and the fact that there are always better alternatives anyway. At the very least, there are the heavy-tailed “weakly informative priors” that put nearly all weight on something reasonable, such as small to moderate values of a variance, and little weight on stupid prior values, such as mean values on the order of 10100.

However, they’ll be around for years to come. Noninformative priors are nice security blankets, and we get to think that we are approaching a problem with an open mind. I guess open minds can have stupid properties as well.

I hope, though, that we will start thinking more deeply about the consequences of our assumptions especially about noninformative priors rather than feeling nice about them.

Sunday, April 28, 2013

MOOCs–a low-risk way to explore outside your field

One of the things I'm realizing from Massively Open Online Courses (MOOCs) -- those online free classes from universities that have seem to sprung up from almost nowhere in the last year and a half -- is that they offer a perfect opportunity to explore outside my field. At first (and this was even before the term MOOC was coined), I took classes there were just outside my field. For instance, I've been in clinical and postmarketing pharmaceutical statistics for over 10 years, and my first two classes were in databases and machine learning. I did this because I was aching to learn something new, but I figured that with a class in databases I could make our database guys in IT sweat a bit just by dropping some terms and showing some understanding of the basics. It worked. In addition, I wanted to understand what this machine learning field was all about, and how it was different from statistics. I accomplished that goal, too.

Since then, I have taken courses in the area of artificial intelligence/machine learning, sociology and networks, scientific computing (separately from statistical computing), and even entrepreneurship. I have also encouraged others to take part in MOOCs, though I don't know the result of that. Finally, I have come back to some classes I've already taken as a community TA, or former student who actively takes part in discussions to help new students take the class.

This is all valuable experience, and I could write several blog entries on the benefits. The main one I'm feeling right now is the feeling that I'm coming up for air, and taking a sampling of other points of view in a low-risk way. For example, though I don't actively use Fourier analysis in my own work, one recent class and one current class both use it to do different things (solve differential equations and process signals). Because these classes involve programming assignments, I've now deepened my understanding of the spectral theorem, which I only studied from a theoretical point of view in graduate school. I'm also thinking about this work from the point of view of time series analysis, which is helping me think about some problems involving longitudinal data at work.

From a completely different standpoint, another class helped me think about salary negotiations in terms of expected payoff (i.e. combination of probability of an offer being accepted vs. salary). This sort of analysis invited further analysis of the value of that job vs. what I would be paid and the insecurity of moving to a different job. In the end, I turned down what would have been a pretty good offer, because I decided it did not compensate for the risks I was incurring. The cool thing is that these were all applying concepts I already understood (expected value, expected payoff), but applied in a different way from what I was already doing.

The best thing about MOOCs is that the risk is low. All that is required is an internet connection and a decent computer. Some math courses may require a better computer to do high-powered math, but I've seen few that require expensive textbooks or expensive software. Even Mathworks is now offering Matlab at student pricing to people who are taking some classes, and Octave remains a free option for people unable to take advantage of it. And, if you are unable to keep up the work, there is now downside. You can simply unenroll.

Monday, April 15, 2013

RStudio is reminding me of the older Macs

The only thing missing is the cryptic ID number.

Well, the only bad thing is that I am trying to run a probabilistic graphical model on some real data, and having a crash like this will definitely slow things down.

Saturday, March 30, 2013

Presenting without slides

Tired of slides, I’ve been experimenting with different ways of presenting. At the recent Conference on Statistical Practice, I decided only to use slides for an outline and references. As it turns out, the most critical feedback I got had to do with the fact that the audience couldn’t follow the organization because I had no slides.

I tried presenting without slides because, well, I started to use them as a crutch. I also saw a lot of people presenting essentially by putting together slides and reading from them. So I figured I would expand my horizons.

Next time I present, I’ll do slides, I guess, but I may try something a bit different.

Wednesday, March 27, 2013

Last session of Caltech's Learning from Data course starts April 2

I just received this email:

Caltech's Machine Learning MOOC is coming to an end this spring, with the final session starting on April 2. There will be no future sessions. The course has attracted more than 200,000 participants since its launch last year, and has gained wide acclaim. This is the last chance for anyone who wishes to take the course (
The Caltech Team
I strongly recommend this course if you can take it, even if you have taken other machine learning classes. It lays a great theoretical foundation for machine learning, sets it off nicely from classical statistics, and gives you some experience working with data as well.

If you were for some reason waiting for the right time, it looks to be now or never.

Wednesday, March 20, 2013

Review of Caltech's Learning from Data e-course

Caltech has an online course Learning from Data, taught by Professor Yaser Abu-Mostafa, that seeks to make the course material accessible to everybody. Unlike most of the online courses I've taken, this one is independently offered through a platform created just for the class. I took the course for its second offering in Jan-March 2013.

The platform on which the course is offered isn't as slick as Coursera. The lectures are offered through a Youtube playlist, and the homeworks are graded through multiple choice. That's perhaps a weakness of the class, but somehow the course faculty made it work.

The class's content was its strong point. Abu-Mostafa weaved theory and pragmatic concerns throughout the class, and invited students to write code in just about any platform (I, of course, chose R) to explore the theoretical ideas in a practical setting. Between this class and Andrew Ng's Machine Learning class on the Coursera platform, a student will have a very strong foundation to apply these techniques to a real-world setting.

I have only one objection to the content, which came in the last lecture. In his description of Bayesian techniques, he claimed that in most circumstances you could only model a parameter with a delta function. This, of course, falls in line with the frequentist notion that you have a constant, but unknowable "state of nature." I felt this way for a long time, but don't really believe it any more in a variety of contexts. I think he played up the Bayesian v. frequentist squabble a bit much, which may have been appropriate 20 years ago but is not so much an issue now.

Otherwise, I found the perspective from the course extremely valuable, especially in the context of supervised learning.

If you plan on taking the course, I recommend leaving a lot of time for it or having a very strong statistical background.

Tuesday, March 12, 2013

Distrust of R

I guess I've been living in a bubble for a bit, but apparently there are a lot of people who still mistrust R. I got asked this week why I used R (and, specifically, the package rpart) to generate classification and regression trees instead of SAS Enterprise Miner. Never mind the fact that rpart code has been around a very long time, and probably has been subject to more scrutiny than any other decision tree code. (And never mind the fact that I really don't like classification and regression trees in general because of their limitations.)

At any rate, if someone wants to pay the big bucks for me to use SAS Enterprise Miner just on their project, they can go right ahead. Otherwise, I have got a bit of convincing to do.