Sunday, November 11, 2012

Analysis of the statistics blogosphere

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Here's what I learned/got reminded of the most:

  • Doing projects like this is hard when you have other responsibilities, and you usually end up paring down your ambitions toward the end
  • Data collection and curation was, as usual, the most difficult process
  • Network analysis is fun, but I have a ways to go to know where to start first, what questions to ask, and so forth (these are the things you learn with experience)
  • The measures that seem to be the most revealing are not always obvious -- in this network, it was the number of shortest paths compared to a random graph
  • Andrew Gelman's blog is central (but you probably don't need a formal analysis to tell you that)
  • There's a lot of great content about statistics, data analysis, data science, and statistical computing out there. I've relied on blog posts for a lot of my work, and I've found even more great stuff. It's a firehose of information.