Sunday, December 9, 2012

MOOCs have exploded!

About a year and two months ago, Stanford University taught three classes online: Intro to Databases, Machine Learning, and Artificial Intelligence. I took two of those classes (I did not feel I had time to take Artificial Intelligence), and found them very valuable. The success of those programs led to the development of at least two companies in a new area of online education: Coursera and Udacity. In the meantime, other efforts have been started (I’m thinking mainly edX, but there are others as well), and now many universities are scrambling to take advantage of either the framework of these companies or other platforms.

Put simply, if you have not already, then you need to make the time to do some of these classes. Education is the most important investment you can make in yourself, and at this point there are hundreds of free online university-level classes in everything from the arts to statistics. If ever you wanted to expand your horizons, now’s the time.

I’ve personally taken 7 online classes now, and earned certificates in all of them. I use the material in many of these classes in my work, and I even have used two (Machine Learning and Probabilistic Graphical Models) to expand my company’s capabilities. I am far more secure in my job because of what I’ve learned. In addition, I had the honor of trying out the Probabilistic Graphical Model Community TA program, and my only regret is that I couldn’t put more time into it. To the extent that I took advantage of it, I got a lot out of the experience.

Now, the hard part. These classes require self-discipline. Like universities, there are some duds as well. At least you can add and drop at will, not worrying about prerequisites. You have to take responsibility for your own education and your own motivation.

In all, I’m very grateful that there are these pioneers Andrew Ng, Daphne Koller, Sebastian Thrun, and others who saw this need and had the knowledge and motivation to fill it. They are now moving in the direction of accreditation, and both free and premium models (probably for some kind of licensing or degree, which I don’t care about right now). For now, you can sign up and take classes at will.

Happy MOOCing!

Wednesday, November 14, 2012

Rare things happen all the time

John Cook reports on the probability of long runs. This is a very useful reality check.

I think there is a larger principle here, though, that rare things happen all the time.

Sunday, November 11, 2012

Analysis of the statistics blogosphere

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Here's what I learned/got reminded of the most:

  • Doing projects like this is hard when you have other responsibilities, and you usually end up paring down your ambitions toward the end
  • Data collection and curation was, as usual, the most difficult process
  • Network analysis is fun, but I have a ways to go to know where to start first, what questions to ask, and so forth (these are the things you learn with experience)
  • The measures that seem to be the most revealing are not always obvious -- in this network, it was the number of shortest paths compared to a random graph
  • Andrew Gelman's blog is central (but you probably don't need a formal analysis to tell you that)
  • There's a lot of great content about statistics, data analysis, data science, and statistical computing out there. I've relied on blog posts for a lot of my work, and I've found even more great stuff. It's a firehose of information.

Monday, November 5, 2012

Snapshot of the statistics blogosphere


This was generated during my social network analysis project. I haven’t finished yet, but I did want to show the cute picture. The statistics blogosphere is like a school of jellyfish.

Sunday, November 4, 2012

Sometimes, saving CPU time is worth it for small data jobs

There appears to be a conventional wisdom, one that I myself have espoused on several occasions, that for “most” statistical computing jobs that developer time is more precious than CPU time. (The reason I write “most” in quotes is that there are some people who work in environments where Big Data or large jobs is the norm, or they are developing high performance computing libraries, and they have to squeeze every last bit of performance out of the CPU.)

However, sometimes it can be worth it to save a few extra minutes small jobs, especially if they are run over and over. At one point today, I had an algorithm that I wrote inefficiently using Python’s built-in lists. I decided to stop the job and rewrite using the NumPy libraries, which took me an extra half hour. At first, I thought the time was wasted, but I have ended up running the code several times for various reasons. Those save minutes have now, a couple of hours later, saved me more time than I spent rewriting.

Friday, November 2, 2012

Politics vs. science and the Nate Silver controversy

I’ll take a small departure from the narrow world of biostatistics and comment on a wider matter.

Nate Silver of FiveThirtyEight has really kicked the hornet’s nest. This is a nest that really needed stirring, but I do not envy him for being the focus of attention.

This all started, I think, when he released his book and basically called political pundits out for a business model of generating drama rather than making good predictions. This wouldn’t be a huge deal, except that he has developed a statistical model that combines data from national and state polls with demographic data to project outcomes of presidential and senatorial elections. This model, as of this writing, has President Obama at close to an 81% probability of re-election, given the current state of things. As it turns out, there are a lot of people that don’t like this, and they generally fall into two camps:

1. People who would rather see President Obama defeated in the election, and

2. Pundits who have a vested interest in a dramatic “horse-race” election

I’ll add a third:

3. Pundits who want to remain relevant (whether to keep their jobs or reputations).

Frankly, I don’t think that pundits will have to worry about #3. There’s an allergy to fact in this country, a large group of people who would rather ignore established fact and cling to a fantasy. (You can find a sampling of these people over at the intelligent design blogosphere, for instance.) I think the demand for compelling stories over dry facts will remain.

I’ve run into people of the first type, when I’ve published some armchair statistician analyses based on Twitter sentiment, for instance. The responses weren’t critiques of the method, but rather, “who cares, Republicans rule!” Even more dangerous, I’ve run into similar responses to negative clinical study results in cases where sponsors have a vested interest in positive outcomes. (There was at least one case I remember a sponsor moved forward with an expensive study to follow on, and some where I was asked to reanalyze a zillion times.)

Nate write The Signal and the Noise where he, among a lot of explanation, points out that there is a whole cottage industry of people getting paid to BS about politics. So I think that some in the second category are starting to face an existential crisis, and that makes them dangerous.

Ultimately, we have to understand where Nate is coming from to understand his prediction. His money is (literally – He made a bet[1] on Twitter with “Morning Joe” Scarborough of NBC) on Obama’s victory in the election, not necessarily because he wants Obama to win, but because he has confidence in his prediction. When he made the bet, he made the controversy more than just trading words, but he called Joe’s bluff (Joe had said that anyone not calling the race a tossup is an ideologue). We can now call him The Statistician Who Kicked the Hornet’s Nest – the punditry, including the public editor of the New York Times that hosts his blog, is collectively attacking him.

Unfortunately, the punditry has the upper hand, because people are more interested in the narrative than the science.

[1] The bet originally consisted of the loser donating $1000 to charity. Nate subsequently donated $2538 to the Red Cross before the election.

Wednesday, October 31, 2012

Willful statistical illiteracy

The fine folks over at Simply Statistics have a very good educational article about the difference between the probability of winning an election and vote share. This article stems from a controversial column over at Politico criticizing Nate Silver and his election forecasts.

Twitter responses are even worse. Conservative filmmaker John Ziegler calls Nate Silver a “hyper-partisan fraud” who is “not an expert on polls.”


Glenn Thrush mentions a “conservative 538:”


And it’s not hard to find other examples.

I’ve run into this reaction a bit, especially when it comes to politics. There are a large group of people, who will dismiss any evidence going against their beliefs. I guess the punditry wasn’t so dismissive of Silver in 2010.

At any rate, I give a recommendation I rarely give: read this Politico article and the comments (ignore the “conservatives aren’t bright” nonsense, which is the same stuff coming from the left).

And let’s thank Nate Silver, RealClearPolitics, and all the honest pollsters who try to shine some data on this election.

Monday, October 29, 2012

The most valuable thing about my little stat blog network project

So, I decided to construct the linking graph through blogrolls, and finally settled on using a manual process. The best part of this project is really finding out for myself all the great content out there!

Monday, October 22, 2012

SNA class proposal

I’ve been taking several classes through Coursera (nothing against the other platforms; I took two of the original three classes via Stanford and just stuck with the platform). The latest one is Social Network Analysis, which has a programming project. Here is what I have posted as a proposal:

Ok, I've been thinking about the programming project idea some, and at first I was thinking of analyzing the statistics blogging community, mostly because I belong to it and I wanted to see what comes out. The analysis below can be done for any sort of community. I've developed this idea a little further and wanted to record it here for two reasons. First, I simply need to write it down to get it out of my head and in such a way that the public can understand it. Second, I'd like feedback.

As it turns out, I took the NLP class in the spring and think there's some overlap that can be exploited. (This comes up nicely in the Mining the Social Web and Programming Collective Intelligence books.) There are measures of content similarity, such as cosine similarity, which are simple to compute and reasonably work well to see how similar content is. Content can then be clustered based on similarity. So, then, I have the following questions:

  • What are the communities, and do they relate to clusters of content similarity?
  • If so, who are the "brokers" between different communities, and what do they blog about? There are a couple of aggregators, such as StatBlogs and R-Bloggers, that I imagine would glue together several communities (that's their purpose and value), but I imagine there are a few others that are aggregator-like + commentary as well. Original content generators, like mine, will probably be on the edges.
  • Is it better to threshold edges based on a number of mentions, or use an edge weight based on the number of mentions?
  • If I have time, I may try to do some sort of topic or named entity extraction, and get an automated way of seeing what these different communities are talking about.

Saturday, October 20, 2012

Nate Silver on The Daily Show

Watch it!

There’s an interesting conversation about how the campaigns use analytics in get out the vote efforts. It doesn’t go a lot in depth, but I think this is an important aspect of campaigns that will come out into public view in the next couple of election cycles.

Of course, you can find his blog at

Wednesday, September 19, 2012

Data cleaning is harder than statistical analysis

Statistical analysis is relatively hard, but it is a piece of cake compared to data collection, cleaning, and manipulation. In fact, in clinical trials research, we spend millions of dollars to develop and advance the capability to effectively manage data. Just about any clinical research organization worth the price has a strong data management department that they’ve spent a lot of time cultivating.

It’s time to take this a step further. In my workplace, we have a very close integration of the statistics group (consisting of statisticians and statistical programmers) and the data management group. In the latest issue of their newsletter, the Society for Clinical Data Management has included an article for the optimal collaboration between statisticians and data managers.  (I take this a step further and include the medical writer.) This collaboration takes a lot of time – time I could be spending doing statistical analysis. However, if the statistical analysis involves working around fewer data issues, it’s all worth it.

Monday, September 10, 2012

Exercise helps statisticians

Statistics is a rather sedentary job, and, over the years, I found my effectiveness decreasing as I found fewer “peak” hours in the day. I also gained a lot of weight. The number of migraines I experienced went from about two a year to about once a month.

In the last two or three years, I’ve been getting out of my chair to go for runs, I’ve taken up taekwondo, and also I a small gym that provides small-group personal training. In addition to adding who knows how many years to my life, they’ve really helped my focus and concentration when I’m doing statistics. I’ve also decided to take once a week or so off of thinking about statistics, which I’m finding helpful.

I only wish I had established these habits years ago.

Wednesday, August 29, 2012

Integrating R into a SAS shop

I work in an environment dominated by SAS, and I am looking to integrate R into our environment.

Why would I want to do such a thing? First, I do not want to get rid of SAS. That would not only take away most of our investment in SAS training and hiring good quality SAS programmers, but it would also remove the advantages of SAS from our environment. These advantages include the following:

  • Many years of collective experience in pharmaceutical data management, analysis, and reporting
  • Workflow that is second to none (with the exception of reproducible research, where R excels)
  • Reporting tools based on ODS that are second to none
  • SAS has much better validation tools than R, unless you get a commercial version of R (which makes IT folks happy)
  • SAS automatically does parallel processing for several common functions

So, if SAS is so great, why do I want R?

  • SAS’s pricing model makes it so that if I get a package that does everything I want, I pay thousands of dollars per year more than the basic package and end up with a system that does way more than I need. For example, if I want to do a CART analysis, I have to buy Enterprise Miner, which does way more than I would need.
  • R is more agile and flexible than SAS
  • R more easily integrates with Fortran and C++ than SAS (I’ve tried the SAS integration with DLLs, and it’s doable, but hard)
  • R is better at custom algorithms than SAS, unless you delve into the world of IML (which is sometimes a good solution).

I’m still looking at ways to do it, although the integration with IML/IML studio is promising.

Monday, August 27, 2012

Romney’s “secretive data mining”–could the same techniques be used for clinical trial enrollment?

Romney has been “exposed” as using “secretive data mining techniques” to find donors to his campaign in traditional Democratic strongholds. (These techniques can be learned in any of these free online courses offered through Coursera and Udacity along with the massive databases collected by the different parties.)

Of course, my thought is, can we use these techniques to find potential participants in clinical trials? I think that if we can work out the privacy issues, this represents a useful tool for clinicians to find not just trial participants, but patients who need to be treated, but for some reason are not being treated. This could be a win for everybody.

Other ideas:

  • using Google trends, much like Google uses to identify flu outbreaks
  • mining discussion boards
  • identifying need through blog networks

I’ll be taking the Web Intelligence and Big Data class through Coursera, so maybe I’ll get more ideas.

Monday, August 20, 2012

Clinical trials: enrollment targets vs. valid hypothesis testing

The questions raised in this Scientific American article ought to concern all of us, and I want to take some of these questions further. But let me first explain the problem.

Clinical trials and observational studies of drugs, biologics, and medical devices are a huge logistical challenge, not the least of which is finding physicians and patients to participate. The thesis of the article is that the classical methods of finding participants – mostly compensation – lead to perverse incentives to lie about one’s medical condition.

I think there is a more subtle issue, and it struck me when one of our clinical people expressed a desire not to put enrollment caps on large hospitals for the sake of a fast enrollment. In our race to finish the trial and collect data, we are biasing our studies toward larger centers where there may be better care. This effect is exactly the opposite of that posited in the article, where treatment effect is biased downward. Here, treatment effect is biased upward, with doctors more familiar with best delivery practices (many of the drugs I study are IV or hospital-based), best treatment practices, and more efficient care.

We statisticians can start to characterize the problem by looking at treatment effect by different sites, or using hierarchical models to separate out center effect from drug. But this isn’t always a great solution, because low-enrolling sites, by definition, have a lot fewer people, and pooling is problematic because low-enrolling centers tend to have way more variation in level and quality of care than high-enrolling centers.

We can get creative on the statistical analysis end of studies, but I think the best solution is going to involve stepping back at the clinical trial logistics planning stage and recasting the recruitment problem in terms of a generalizability/speed tradeoff.

Wednesday, August 15, 2012

Statisticians need marketing

I can't think of too much I would disagree with in Simply Statistics's posting on Statistics/statisticians need better marketing. I'll elaborate on a couple of points:

We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era.
Apparently, the idea of data mining was rejected by most statisticians about 30 years ago, and it has found a home in computer science. Now data science is growing out computer science, and analytics seems to be growing out of some hybrid of computer science and business. The embracing of the terms "data science" and "analytics" puzzled me for a long time, because these fields seemed to be just statistics, data curation, and understanding of the application. (I recognize now that there is some more to it, especially the strong computer science component especially in big data applications.) I now see the tension between statisticians and practitioners of these related fields, and the puzzlement remains. Statisticians have a lot to contribute to these blooming fields, and we damn well better get to it.

We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be. 
Stanford has this down to a business plan. So do some other universities. This trail is being blazed, and we can just hop on it.

It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data.
Statistics without borders is a great place to do this. Perhaps SWB needs better marketing as well?

Monday, August 13, 2012

Observational data is valuable

I’ve heard way too many times that observational studies are flawed, and to really confirm a hypothesis you have to do randomized controlled trials. Indeed, this was an argument in the hormone replacement therapy (HRT) controversy (scroll down for the article). Now that I’ve worked with both observational and randomized data, here are a few observations:

  • The choice of observational vs. randomized is an important, but not the only, study design choice.

    Studies have lots of different design choices: followup length, measurement schedule, when during disease course to observe, assumptions about risk groups, assumptions about stability of risk over time (which was important in the HRT discussion about breast cancer), and the list goes on. A well-designed observational trial can give a lot of more valid information than a poorly-designed randomized trial.
  • Only one aspect of a randomized trial is randomized (usually). Covariates and subgroups are not randomized.
  • Methods exists to make valid comparisons in an observational study. While data have to be handled much more carefully, and assumptions behind the statistical methods have to be examined more carefully. However, very powerful methods such as causal analysis or case-control studies can be used to make strong conclusions.

Observational studies can complement or replace randomized designs. In fact, in controversies such as the use of thimerosol in vaccines, observational studies have been required to supply all the evidence (randomizing children to thimerosol and non-thimerosol groups in a randomized study to see if they develop autism is not ethical). In post-marketing research and development for drugs, observational studies are used to further establish safety, determine the rate of rare serious adverse events, and determine the effects of real-world usage on the efficacy that has been established through randomized trials.

Through careful planning, observational studies can generate new results, extend the results of randomized trials, or even set up new randomized trials.

Wednesday, August 8, 2012

The problem of multiple comparisons

John Cook's discussion at Wrong and unnecessary — The Endeavour and the comments to that post are all worth reading (I rarely say that about comments to a post). The post is ostensibly on whether a linear model is useful even though it is no, in the perfect sense of the word, correct. In the comments, the idea of multiple comparisons is brought up, and not just whether they are appropriate but to what extent comparisons must be adjusted.

(For those not familiar with the problem, the Wikipedia article is worth reading. Essentially, multiple comparisons is a problem in statistics where you have a greater chance of declaring statistical significance if you compare multiple endpoints naively, or compare the same one over and over as you collect more data. Statisticians have many methods for adjusting for this effect depending on the situation.)

In pivotal trials, usually only the primary endpoints have multiple comparisons applied, and sometimes the multiple comparisons are applied separately to secondary endpoints. (I find this practice bizarre, though I have heard it endorsed at least once by a reviewing team at the FDA.) Biostatisticians have complained that testing covariate imbalances at baseline (i.e. did people entering into treatment and placebo group have the same distribution of ages?) add to the multiple comparisons problem, even though these baseline tests do not directly lead to some declaration of significance. Bayesian methods do not necessarily "solve" the multiple comparisons problem, but rather account appropriately for multiple testing when looking at data multiple times or looking at multiple endpoints, if the methodology is set up appropriately.

Multiple comparison methods tend to break down in situations of a large number of experiments or dimensions, such as "A/B" experiments that web sites tend to run or testing for the existence of adverse events of a drug, where sponsors tend to want to see a large number of Fisher's Exact tests. (In fact, the second situation suffers from the more fundamental problem that committing a Type I error - declaring the existence of an adverse event where there is none - is more conservative than committing a Type II error. Multiple comparison adjustments can lead to erroneous assurances of safety, while not adjusting can lead to lots of additional research confirming or denying the existence of significant Fisher's tests.)

I have even heard the bizarre assertion that multiple comparison adjustments are required when discussing several tests on one biological process, but then you get another multiple comparison adjustment when you do tests on another process even in the same study.

I assert that while we have a lot of clever methods for multiple comparisons, we have bizarre and arbitrary rules for when to apply them. Multiple comparison methodologies (simple, complex, or modern) control Type I error rate or similar measures over the tests to which they are applied, so I think that the coverage of these tests needs to be justified scientifically beyond such arbitrary rules.

'via Blog this'

Monday, August 6, 2012

Getting connected: why you should get connected to people, and how

Getting connected to professionals in your field can be difficult, but it’s worth the effort. Here’s why, in no particular order:

  1. you exchange ideas, find new ways to approach problems, share career experiences, and learn to navigate the multitude of aspects of your profession
  2. you form connections that can potentially  help if you need to change jobs
  3. it’s fun to be social (even if you are an introvert like me)
  4. you can potentially add a lot of value to your company, leading to career advancement opportunities
  5. you can justify going to cool conferences, if you enjoy those
  6. you have a better chance of new opportunities to publish, get invited talks, or collaborate

The above is fairly general, but for the how I will focus on statisticians because that is where I can offer the most:

  1. Join a professional organization. For statisticians in the US, join the American Statistical Association (ASA). Other countries have similar organization, for instance, the UK has the Royal Statistical Society, and Canada, India, and China have similar societies. In addition, there are more specialized groups such as the Institute for Mathematical Statistics, Society for Industrial and Applied Mathematics, East North American Region of the International Biometric Society, West North American Region of the International Biometric Society, and so forth. The ASA is very broad, and these other groups are more specialized. Chances are, there is a specialized group in your area.
  2. If you join the ASA, join a section, and find out if there is an active local chapter as well. The ASA is so huge that it is overwhelming to new members, but sections are smaller and more focused, and local chapters offer the opportunity to connect personally without a lot of travel or distance communication.
  3. You might start a group in your home town, such as an R User’s Group. Revolution Analytics will often sponsor a fledgling R User’s Group. Of course, this startup doesn’t have to be focused on R.
  4. If you have been a member for a couple of years, offer to volunteer. Chances are, the work is not glorious, but it will be important. The most important part, anyway, is that you will gain skills coordinating others and meet new people.
  5. If you go to a conference, offer to chair and try to speak. It is very easy to speak at the Joint Statistical Meetings.
  6. Use social media to get online connections, then try to meet these people in real life. I have formed several connections because I blog and tweet (@randomjohn).  You can also use Google+, though I haven’t quite figured out how to do so effectively. I also don’t use Facebook that much for my professional outlet, but it is possible. Blogging offers a lot of other benefits as well, if you do it correctly. Blogging communities, such as R Bloggers and SAS Community, enhance the value of blogging.

Getting connected is valuable, and it takes a lot of work. Think of it as a career-long effort, and your network as a garden. It takes time to set up, maintain, and cultivate, but the effort is worth it.

Thursday, August 2, 2012

JSM 2012 in the rearview: reflections on the world's largest gathering of statisticians

The joint statistical meetings is an annual gathering of several large professional organizations of statisticians, and annually we descend on some city to share ideas. I'm a perennial attendee, and always find the conference valuable in several ways. I have a few thoughts about the conference in retrospect:

* For me, networking is much more important than talks. Of course, attending talks in your area of interest is a great way of finding people with whom to network.
* I'm very happy I volunteered with the biopharmaceutical section a couple of years ago. It's a lot of work, but rewarding.
* This year, I specifically went to a few sections out of my area, and found the experience valuable.
* I definitely recommend chairing a session or speaking.
* I also recommend roundtable lunches. I did one for the first time this year, and found the back and forth discussion valuable.

In short, I find that connecting with like-minded professionals to be an important part of my career and development as a person.

Wednesday, May 30, 2012

Statistical leadership, part IV: the world needs you

Read this, even if you are not a statistician. Go on, I'll be here when you get back.

This article was adapted from Roger Hoerl's excellent Deming Lecture at the Joint Statistical Meetings in 2011. This is a call to action, of course, but a call to something even deeper. In a time when we are running very short on critical thinking, we need more people to think critically and speak up. Critical thought, of course, implies more than just speaking against something (the status quo, proposed solutions, or other object of thought), but rather seeking a deeper understanding of the problems that face us, and what the most effective solutions are. In this imperfect world of tradeoffs, we have to understand the impact of solutions and of solving problems.

This is a call to understand our world, and to make it better. (The former does not necessarily precede the latter.)

Tuesday, May 15, 2012

Thoughts on privacy

As this world gets more connected, and as data storage and analysis advances, we have to change our notions of privacy and data stewardship. About 25 years ago, right before email hit the big time and data analysis methods were limited to small datasets or Cray supercomputers, having data was a huge deal. Coverups, such as Watergate, were characterized by hiding data from others. While still true, it’s a lot harder, and, with increases in computing speed and availability of data, it’s a lot harder to hide from the rest of the world.

Whether we like it or not, our notions of privacy have to change. In a recent instance, Target knew of a daughter’s pregnancy before her father did. (Mailings to the house were the source of a lot of consternation and an uncomfortable chat.) Doing this is fairly easy: you assign an ID number to each customer based on credit cards or loyalty cards, mine purchase data for what can predict not just pregnancy but also a due date, and apply it to future customers. Many first year statistics grad students have already learned the basic methods for doing this. This cat is out of the bag, and it’s not going back in. We will not be able to legislate this practice out of existence (and perhaps we shouldn’t be, anyway).

So what now? How does privacy have to change? It appears that a new attitude toward privacy is rising, but this is equally disturbing. In the link, teens were given Blackberries, with the understanding that everything they did on it would be monitored and analyzed, and they still went for it. They even did drug deals using these devices!

I think our privacy laws have to evolve to deal with this new reality. We require de-identified data for data released to the public, but even that strategy will only be useful for so long. No, the bounds of acceptable behavior based on data have to be re-thought. For example, is it ok to drop insurance coverage based on FB postings of drunken parties or certain tweets? Is it ok to terminate an employee because a manager did some social network analysis of public data and found some badmouthing of the company? Is it ok for a car insurance company to bump up your premium because you blogged about Top Gear? Answering these kinds of questions, which really are just a couple of steps away from product recommendations, with legislation will just be the start.

Monday, April 30, 2012

Statistical leadership part III–shameless plug for PharmaSUG talk

PharmaSUG is a yearly gathering of SAS programmers who program for the pharmaceutical industry. This year, Dr. Katherine Troyer of REGISTRAT-MAPI will be giving a talk entitled “Giving Data a Voice: Partnering with Medical Writing for Best Reporting Practices,” in which she will implore the audience to get statisticians, medical writers, SAS programmers, clinicians, data managers, and any other stakeholder together early and often in the clinical trial process. While it may seem like the medical writer may only need to come into the process late, they actually have to put everything together. In the spirit of beginning with the end in mind, planning should include all of us.

If you’re going to PharmaSUG this year, please attend this talk!

Monday, April 23, 2012

Coursera (and other online classes)

A revolution is taking place in education. Last fall, Stanford University premiered three online classes in Artificial Intelligence, Machine Learning, and Introduction to Databases. I took Machine Learning and Intro to Databases, and this spring I’m taking Probabilistic Graphical Models, Natural Language Processing, and Model Thinking.

This winter and spring, that effort has evolved into Coursera, and the course offering has expanded to about 30 courses across disciplines and difficulties. Other universities, such as the University of Michigan, UPenn, and Princeton have gotten in on the action. Other professors have their own effort called Udacity (which concentrates on computer science and artificial intelligence after the primary interest of Sebastian Thrun of the Google robotic car), and MIT has developed their own platform.

So far all my classes have been through have been high quality. There are a few glitches as Coursera is blazing trails here, but overall I’m happy to take a small part in this revolution.

Wednesday, March 21, 2012

Using R for a salary negotiation–an extension of decision tree models

Let’s say you are in the middle of a salary negotiation, and you want to know whether you should be aggressive in your offering or conservative. One way to help with the decision is to make a decision tree. We’ll work with the following assumptions:

  • You are at a job currently making $50k
  • You have the choices between asking $60k (which will be accepted with probability 0.8) or $70k (which will be accepted with probability 0.2).
  • You get one shot. If your asking price is rejected, you stay at your current job and continue to make $50k. (This is one of those simplifying assumptions that we might dispense with later.)

This simplification of reality can be represented with a decision tree:


I went ahead and put in the expected payoff for each of these decisions. Because the more conservative approach has a higher expected payoff, this model suggests that you should take the conservative approach.

One shortcoming clearly is that this decision tree only shows two decisions, but really you have a range of decisions; you are not stuck with $60k or $70k for asking price. You might go with $65k, or $62.5k, or something else. So what would be the optimal asking price?

Again, we look at expected payoff, which is asking price*probability(offer accepted) + $50k * probability(offer rejected). In this case, we need to model the probability that offer is accepted over the range of possible offers, not just the two points. The logistic model works very well for modeling probability, and that’s what I will use here to extend the two-point model. In fact, a logistic model with two parameters can be fit exactly to two points, and so that is what I will use here.

Here is my commented R code to implement this model:

my.offer <- function(x1=60,py.x1=.2,x2=70,py.x2=.8,,high=100,p.payoff=1) {
# return the offer to maximize expected payoff
# this assumes a game with one decision and one consequence
# you give an offer, and it is taken or refused. If taken, you receive a salary of
# (a function of) the offer. If refused, you stay at the old job and receive a
# salary of (presumably a current salary, but set to 0 if you are
# unemployed).
# the probability of rejection is modeled with a logistic function defined by
# two points (x1,py.x1) and (x2,py.x2)
# for example, if you expected a 20% rej. prob. with an offer of 140k, then
# x1,py.x1 = 140,.2. Similarly with x2,py.x2
# the expected payoff is modeled as offer*P(Yes|offer) +*P(No|offer),
# perhaps with modifications to account for benefits, negotiation, etc. This
# is defined in payoff function below.
# finally, high is defined as anything above what you would be expecting to offer
# and is used to create the plot limits and set the bounds in the optimization
# routine.   # model the probability of no given salary offer
# here we have a logistic function defined by (x1,py.x1) and (x2,py.x2)
# note that qlogis is the inverse logit function
# also, matrices in R are defined in column-major form, not row-major form like
# FORTRAN, so we have to use 1,1,x1,x2 rather than 1,x1,1,x2
theta <- solve(matrix(c(1,1,x1,x2),nc=2),matrix(qlogis(c(py.x1,py.x2)),nc=1))   # for plot of probability function
xseq <- seq(,high,length=100)
yseq1 <- 1/(1+exp(-theta[1]-theta[2]*xseq))   # model the expected payoff of an offer
# model negotiations, benefits, and other things here
# (a simple way to model benefits though is just to change
payoff <- function(x) {
tmp <- exp(-theta[1]-theta[2]*x)
return( ( + ifelse(>x*p.payoff,,x*p.payoff)*tmp)/(1+tmp) )
}     yseq <- payoff(xseq)   # plots
plot(xseq,yseq,type='l',xlab='Offer',ylab='Expected salary')   # no sense in even discussing the matter if offer <

Created by Pretty R at

And here are the graphs and result:


> my.offer()
[1] 61.96761

[1] 58.36087

So this model suggests that the optimum offer is close to $62k, with an expected payoff of around $58k. As a side effect, a couple of graphs are produced: giving the probability of rejection as a function of the asking price, and the expected salary (payoff) as a function of asking price.

So a few comments are in order:

  • The value in this model is in varying the inputs and seeing how that affects the optimum asking price.
  • The function I provided is slightly more complicated than what I presented in this post. You can model things like negotiation (i.e. you may end up at a little less than your asking price if you are not turned down right away), differences in benefits, and so forth. Once you have a simple and reliable baseline model with which to work, you can easily modify it to account for other factors.
  • Like all models, this is an oversimplification of the salary negotiation process, but a useful oversimplification. There are cases where you want to be more aggressive in your asking, and this model can point those out.
  • I commented the code profusely, but the side effects are probably not the best programming practice. However, this really is a toy model, so feel free to rip off the code.
  • This model of course extends to other areas where you have a continuous range of choices with payoffs and/or penalties.
  • If you are able to gather data on the probability of rejection based on offer, so much the better. You can then, instead of fitting an exact probability model, perform a logistic regression and use that as the basis of the expected payoff calculation.

Monday, March 5, 2012

Why I hate p-values (statistical leadership, Part II)

One statistical tool is the ubiquitous p-value. If it’s less than 0.05, your hypothesis must be true, right? Think again.

Ok, so I don’t hate p-values, but I do hate the way that we abuse them. And here’s where we need statistical leadership to go back and critique these p-values before we get too excited.

P-values can make or break venture capital deals, product approval for drugs, or senior management approval for a new design of deck lid. In that way, we place a little too much trust in them. Here’s where we abuse them:

  • The magical 0.05: if we get a 0.51, we lose, and if we get a 0.49, we win! Never mind that the same experiment run under the same conditions can easily produce both of these results. (The difference between statistically significant and not significant is not significant.)
  • The misinterpretation: the p-value is not the probability of the null hypothesis being true, but rather the long-run relative frequency of times that data from the similar experiments run under the same conditions will produce a test statistic that is at least the value that you had in your experiment, if the null hypothesis is true. Got that? Well, no matter how small your p-value is, I can get a wimpy version of your treatment and get a smaller p-value, just by increasing the sample size to what I need. P-values depend on effect size, effect variance, and sample size.
  • The gaming of the p-value: in clinical trials it’s possible to make your p-value smaller by restricting your subject entry criteria to what brings out the treatment effect the most. This is not usually a problem, except to keep in mind that the rarified world of early phase clinical studies is different from the real world.
  • The unethical gaming of the p-value: this comes from retrospectively tweaking your subject population. I guess it’s ok if you don’t try to pass this off as real results, but rather as information for further study design, but you can’t expect any scientific validity to tweaking a study, its population, or its analysis after the results are in.
  • Covariate madness: covariates tend to decrease the p-value by partitioning the variation in drug effect. That’s great if you want to identify segments of your patient population. But if you do covariate selection and then report your p-value from the final model, you have a biased p-value.

Statisticians need to stay on top of these issues and advocate for the proper interpretation of p-values. Don’t leave it up to someone with an incomplete understanding of these tools.

Saturday, January 14, 2012

Faster reading through math

Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

Or you can use the cute interface in Safari, if you care to switch.

Wednesday, January 4, 2012

Competing in data mining competitions

I’m competing in several data mining competitions over at Kaggle. So far, I haven’t really done well, but I am learning a lot. Here’s what I’m getting out of it:

  • Variety in applying statistical techniques to real-world problems
  • Clarifying for myself what the bias-variance tradeoff really means
  • Trying new techniques, such as those I got out of the free online machine learning class
  • Humility

If you’re into statistics, you should try it! Kaggle isn’t the only competition forum in town, but it’s a good one. (Tunedit has one competition in classification of biomedical papers, and KDNuggets regularly announces contests from sites.