Wednesday, November 14, 2012

Rare things happen all the time

John Cook reports on the probability of long runs. This is a very useful reality check.

I think there is a larger principle here, though, that rare things happen all the time.

Sunday, November 11, 2012

Analysis of the statistics blogosphere

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Here's what I learned/got reminded of the most:

  • Doing projects like this is hard when you have other responsibilities, and you usually end up paring down your ambitions toward the end
  • Data collection and curation was, as usual, the most difficult process
  • Network analysis is fun, but I have a ways to go to know where to start first, what questions to ask, and so forth (these are the things you learn with experience)
  • The measures that seem to be the most revealing are not always obvious -- in this network, it was the number of shortest paths compared to a random graph
  • Andrew Gelman's blog is central (but you probably don't need a formal analysis to tell you that)
  • There's a lot of great content about statistics, data analysis, data science, and statistical computing out there. I've relied on blog posts for a lot of my work, and I've found even more great stuff. It's a firehose of information.

Monday, November 5, 2012

Snapshot of the statistics blogosphere

stats_blogs

This was generated during my social network analysis project. I haven’t finished yet, but I did want to show the cute picture. The statistics blogosphere is like a school of jellyfish.

Sunday, November 4, 2012

Sometimes, saving CPU time is worth it for small data jobs

There appears to be a conventional wisdom, one that I myself have espoused on several occasions, that for “most” statistical computing jobs that developer time is more precious than CPU time. (The reason I write “most” in quotes is that there are some people who work in environments where Big Data or large jobs is the norm, or they are developing high performance computing libraries, and they have to squeeze every last bit of performance out of the CPU.)

However, sometimes it can be worth it to save a few extra minutes small jobs, especially if they are run over and over. At one point today, I had an algorithm that I wrote inefficiently using Python’s built-in lists. I decided to stop the job and rewrite using the NumPy libraries, which took me an extra half hour. At first, I thought the time was wasted, but I have ended up running the code several times for various reasons. Those save minutes have now, a couple of hours later, saved me more time than I spent rewriting.

Friday, November 2, 2012

Politics vs. science and the Nate Silver controversy

I’ll take a small departure from the narrow world of biostatistics and comment on a wider matter.

Nate Silver of FiveThirtyEight has really kicked the hornet’s nest. This is a nest that really needed stirring, but I do not envy him for being the focus of attention.

This all started, I think, when he released his book and basically called political pundits out for a business model of generating drama rather than making good predictions. This wouldn’t be a huge deal, except that he has developed a statistical model that combines data from national and state polls with demographic data to project outcomes of presidential and senatorial elections. This model, as of this writing, has President Obama at close to an 81% probability of re-election, given the current state of things. As it turns out, there are a lot of people that don’t like this, and they generally fall into two camps:

1. People who would rather see President Obama defeated in the election, and

2. Pundits who have a vested interest in a dramatic “horse-race” election

I’ll add a third:

3. Pundits who want to remain relevant (whether to keep their jobs or reputations).

Frankly, I don’t think that pundits will have to worry about #3. There’s an allergy to fact in this country, a large group of people who would rather ignore established fact and cling to a fantasy. (You can find a sampling of these people over at the intelligent design blogosphere, for instance.) I think the demand for compelling stories over dry facts will remain.

I’ve run into people of the first type, when I’ve published some armchair statistician analyses based on Twitter sentiment, for instance. The responses weren’t critiques of the method, but rather, “who cares, Republicans rule!” Even more dangerous, I’ve run into similar responses to negative clinical study results in cases where sponsors have a vested interest in positive outcomes. (There was at least one case I remember a sponsor moved forward with an expensive study to follow on, and some where I was asked to reanalyze a zillion times.)

Nate write The Signal and the Noise where he, among a lot of explanation, points out that there is a whole cottage industry of people getting paid to BS about politics. So I think that some in the second category are starting to face an existential crisis, and that makes them dangerous.

Ultimately, we have to understand where Nate is coming from to understand his prediction. His money is (literally – He made a bet[1] on Twitter with “Morning Joe” Scarborough of NBC) on Obama’s victory in the election, not necessarily because he wants Obama to win, but because he has confidence in his prediction. When he made the bet, he made the controversy more than just trading words, but he called Joe’s bluff (Joe had said that anyone not calling the race a tossup is an ideologue). We can now call him The Statistician Who Kicked the Hornet’s Nest – the punditry, including the public editor of the New York Times that hosts his blog, is collectively attacking him.

Unfortunately, the punditry has the upper hand, because people are more interested in the narrative than the science.

[1] The bet originally consisted of the loser donating $1000 to charity. Nate subsequently donated $2538 to the Red Cross before the election.