Realizations in Biostatistics

Wednesday, March 21, 2012

Using R for a salary negotiation–an extension of decision tree models

Let’s say you are in the middle of a salary negotiation, and you want to know whether you should be aggressive in your offering or conservative. One way to help with the decision is to make a decision tree. We’ll work with the following assumptions:

You are at a job currently making $50k
You have the choices between asking $60k (which will be accepted with probability 0.8) or $70k (which will be accepted with probability 0.2).
You get one shot. If your asking price is rejected, you stay at your current job and continue to make $50k. (This is one of those simplifying assumptions that we might dispense with later.)

This simplification of reality can be represented with a decision tree:

I went ahead and put in the expected payoff for each of these decisions. Because the more conservative approach has a higher expected payoff, this model suggests that you should take the conservative approach.

One shortcoming clearly is that this decision tree only shows two decisions, but really you have a range of decisions; you are not stuck with $60k or $70k for asking price. You might go with $65k, or $62.5k, or something else. So what would be the optimal asking price?

Again, we look at expected payoff, which is asking price*probability(offer accepted) + $50k * probability(offer rejected). In this case, we need to model the probability that offer is accepted over the range of possible offers, not just the two points. The logistic model works very well for modeling probability, and that’s what I will use here to extend the two-point model. In fact, a logistic model with two parameters can be fit exactly to two points, and so that is what I will use here.

Here is my commented R code to implement this model:

my.offer <- function(x1=60,py.x1=.2,x2=70,py.x2=.8,ev.no=50,high=100,p.payoff=1) {
  # return the offer to maximize expected payoff
  # this assumes a game with one decision and one consequence
  # you give an offer, and it is taken or refused. If taken, you receive a salary of
  # (a function of) the offer. If refused, you stay at the old job and receive a
  # salary of ev.no (presumably a current salary, but set to 0 if you are
  # unemployed).
  # the probability of rejection is modeled with a logistic function defined by
  # two points (x1,py.x1) and (x2,py.x2)
  # for example, if you expected a 20% rej. prob. with an offer of 140k, then
  # x1,py.x1 = 140,.2. Similarly with x2,py.x2
  # the expected payoff is modeled as offer*P(Yes|offer) + ev.no*P(No|offer),
  # perhaps with  modifications to account for benefits, negotiation, etc. This
  # is defined in payoff function below.
  # finally, high is defined as anything above what you would be expecting to offer
  # and is used to create the plot limits and set the bounds in the optimization
  # routine.   # model the probability of no given salary offer
  # here we have a logistic function defined by (x1,py.x1) and (x2,py.x2)
  # note that qlogis is the inverse logit function
  # also, matrices in R are defined in column-major form, not row-major form like
  # FORTRAN, so we have to use 1,1,x1,x2 rather than 1,x1,1,x2
  theta <- solve(matrix(c(1,1,x1,x2),nc=2),matrix(qlogis(c(py.x1,py.x2)),nc=1))   # for plot of probability function
  xseq <- seq(ev.no,high,length=100)
  yseq1 <- 1/(1+exp(-theta[1]-theta[2]*xseq))   # model the expected payoff of an offer
  # model negotiations, benefits, and other things here
  # (a simple way to model benefits though is just to change ev.no)
  payoff <- function(x) {
    tmp <- exp(-theta[1]-theta[2]*x)
    return( (ev.no + ifelse(ev.no>x*p.payoff,ev.no,x*p.payoff)*tmp)/(1+tmp) )
  }     yseq <- payoff(xseq)   # plots
  par(mfrow=c(1,2))
  plot(xseq,yseq1,type='l',xlab='Offer',ylab='P(No|X)')
  plot(xseq,yseq,type='l',xlab='Offer',ylab='Expected salary')   # no sense in even discussing the matter if offer < ev.no
  return(optimize(payoff,interval=c(ev.no,high),maximum=TRUE))
}

Created by Pretty R at inside-R.org

And here are the graphs and result:

> my.offer()
$maximum
[1] 61.96761

$objective
[1] 58.36087

So this model suggests that the optimum offer is close to $62k, with an expected payoff of around $58k. As a side effect, a couple of graphs are produced: giving the probability of rejection as a function of the asking price, and the expected salary (payoff) as a function of asking price.

So a few comments are in order:

The value in this model is in varying the inputs and seeing how that affects the optimum asking price.
The function I provided is slightly more complicated than what I presented in this post. You can model things like negotiation (i.e. you may end up at a little less than your asking price if you are not turned down right away), differences in benefits, and so forth. Once you have a simple and reliable baseline model with which to work, you can easily modify it to account for other factors.
Like all models, this is an oversimplification of the salary negotiation process, but a useful oversimplification. There are cases where you want to be more aggressive in your asking, and this model can point those out.
I commented the code profusely, but the side effects are probably not the best programming practice. However, this really is a toy model, so feel free to rip off the code.
This model of course extends to other areas where you have a continuous range of choices with payoffs and/or penalties.
If you are able to gather data on the probability of rejection based on offer, so much the better. You can then, instead of fitting an exact probability model, perform a logistic regression and use that as the basis of the expected payoff calculation.

Monday, March 5, 2012

Why I hate p-values (statistical leadership, Part II)

One statistical tool is the ubiquitous p-value. If it’s less than 0.05, your hypothesis must be true, right? Think again.

Ok, so I don’t hate p-values, but I do hate the way that we abuse them. And here’s where we need statistical leadership to go back and critique these p-values before we get too excited.

P-values can make or break venture capital deals, product approval for drugs, or senior management approval for a new design of deck lid. In that way, we place a little too much trust in them. Here’s where we abuse them:

The magical 0.05: if we get a 0.51, we lose, and if we get a 0.49, we win! Never mind that the same experiment run under the same conditions can easily produce both of these results. (The difference between statistically significant and not significant is not significant.)
The misinterpretation: the p-value is not the probability of the null hypothesis being true, but rather the long-run relative frequency of times that data from the similar experiments run under the same conditions will produce a test statistic that is at least the value that you had in your experiment, if the null hypothesis is true. Got that? Well, no matter how small your p-value is, I can get a wimpy version of your treatment and get a smaller p-value, just by increasing the sample size to what I need. P-values depend on effect size, effect variance, and sample size.
The gaming of the p-value: in clinical trials it’s possible to make your p-value smaller by restricting your subject entry criteria to what brings out the treatment effect the most. This is not usually a problem, except to keep in mind that the rarified world of early phase clinical studies is different from the real world.
The unethical gaming of the p-value: this comes from retrospectively tweaking your subject population. I guess it’s ok if you don’t try to pass this off as real results, but rather as information for further study design, but you can’t expect any scientific validity to tweaking a study, its population, or its analysis after the results are in.
Covariate madness: covariates tend to decrease the p-value by partitioning the variation in drug effect. That’s great if you want to identify segments of your patient population. But if you do covariate selection and then report your p-value from the final model, you have a biased p-value.

Statisticians need to stay on top of these issues and advocate for the proper interpretation of p-values. Don’t leave it up to someone with an incomplete understanding of these tools.

Saturday, January 14, 2012

Faster reading through math

Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

Or you can use the cute interface in Safari, if you care to switch.

Wednesday, January 4, 2012

Competing in data mining competitions

I’m competing in several data mining competitions over at Kaggle. So far, I haven’t really done well, but I am learning a lot. Here’s what I’m getting out of it:

Variety in applying statistical techniques to real-world problems
Clarifying for myself what the bias-variance tradeoff really means
Trying new techniques, such as those I got out of the free online machine learning class
Humility

If you’re into statistics, you should try it! Kaggle isn’t the only competition forum in town, but it’s a good one. (Tunedit has one competition in classification of biomedical papers, and KDNuggets regularly announces contests from sites.

Tuesday, December 27, 2011

Lots of open education resources for your gadgetry

While not related exclusively to statistics, this resource does relate to open education. This page at OpenCulture.com gives a large list of resources of books, courses, and media you can use to fill your new (or old) gadget. They have a list of free online courses as well, with statistics falling under computer science, engineering, and mathematics.

Sounds like Christmas and New Year’s resolution wrapped up into one nice gift!

Thursday, December 22, 2011

A statistician’s view of Stanford’s open Introduction to Databases class

In addition to the Introduction to Machine Learning class (which I have reviewed), I took a class on introduction to databases, taught by Prof. Jennifer Widom. This class consisted of video lectures, review questions, and exercises. Topics covered included XML (markup, DTDs, schema, XPath, XQuery, and XSLT) and relational databases (relational algebra, database normalization, SQL, constraints, triggers, views, authentication, online analytical processing, and recursion). At the end we had a quick lesson on NoSQL systems just to introduce the topic and discuss where they are appropriate.

This class was different in structure from the Machine Learning class in two ways: there were two exams and the potential for a statement of accomplishment.

I think any practicing statistician should learn at least the material on relational databases, because data storage and retrieval is such an important part of statistics today. Many different statistical packages now connect to relational databases through technologies like ODBC, and knowledge of SQL can enhance the use of these systems. For example, for subset analyses it is usually better to do subsetting with the database than pull all the data into the statistical analysis package and then perform the subset. In biostatistics, the data are usually collected in a database using an electronic data capture or paper-based system, which store the data in an Oracle database.

I found that I already use the material in this course, even though I typically don’t write SQL queries more complicated than a simple subset. Some of the examples for the exercises involved representing a social network, which may help when I do my own analyses of networks. Other examples were of relationships that you might find in biostatistics and other fields.

I found that by spending up to 5 hours a week on the class I was able to get a lot out of it. Unfortunately, Stanford is not offering this in the winter quarter, but they have promised to offer this again in the future. I heartily recommend it for anyone practicing statistics, and the price is right.

Tuesday, December 20, 2011

MIT launches online learning initiative

This may not relate directly to statistics, but it relate to my experiences in an online introductory machine learning class.

MIT has decided to launch online public classes of its own. It looks like they are making the platform open-source as well and encouraging other institutions to go online as well.

This may well be the most exciting development in education in years. Hopefully it won’t be too long before we see efforts to get hardware into disadvantaged neighborhoods or other countries (such as the one laptop per child project from a few years ago) combined with this online education.

For Stanford’s winter 2012 classes, check out Class Central.

Monday, December 19, 2011

A statistician’s view on Stanford’s public machine learning course

This past fall, I took Stanford’s class on machine learning. Overall, it was a terrific experience, and I’d like to share a few thoughts on it:

A lot of participants were concerned that it was a watered down version of Stanford’s CS229. And, in fact, the course was more limited in scope and more applied than the official Stanford class. However, I found this to be a strength. Because I was already familiar with most of the methods in the beginning (linear and multiple regression, logistic regression), I could focus more on the machine learning perspective that the class brought to these methods. This helped in later sections where I wasn’t so familiar with the methods.
The embedded review questions and the end of section review questions were very well done, with some randomization algorithm making it impossible to guess until everything was right.
Programming exercises were done in Octave, an open source Matlab-like programming environment. I really enjoyed doing this programming, because it meant I essentially programmed regression and logistic regression algorithms by hand with the exception of a numerical optimization algorithm. I got a huge confidence boost when I managed to get the backpropagation algorithm for neural networks correct. Emphasis on these exercises was on the loops, which you could code using “slow” loops (for loops, for instance), but then really needed to vectorize using the principles of linear algebra. For instance, there was an algorithm for a recommender system that would take hours if coded using for loops, but ran in minutes using a vectorized implementation. (This is because the implicit loops of vectorization were run using optimized linear algebra routines.) In statistics, we don’t always worry about implementation details so much, but in machine learning situations, implementation is important because these algorithms often need to run in real time.
The class encouraged me to look at the Kaggle competitions. I’m not doing terribly well in them, but now at least I’m hacking on some data myself and learning a lot in the process.
The structure of the public class helps a lot over, for example, the iTunes U version of the class. But now I’m looking at the CS 229 lectures on iTunes U and am understanding them a lot more now.
Kudos to Stanford for taking the lead on this effort. This is the next logical progression of distance education, and takes a lot of effort and time.

I also took the databases class, which was even more structured with a mid-term and final exam. This was a bit of a stretch for me, but learning about data storage and retrieval is a good complement to statistics and machine learning. I’ve coded a few complex SQL queries in my life, but this class really took my understanding of both XML-based and relational database systems to the next level.

Stanford is offering machine learning again, along with a gaggle of other classes. I recommend you check them out. (Find a list, for example, at the bottom of the page of Probabilistic Graph Models site.) (Note: Stanford does not offer official credit for these classes.)

Friday, December 16, 2011

It’s not every day a new statistical method is published in Science

I’ll have to check this out – Maximal Information-based Nonparametric Exploration (MINE - har har). The link to the paper in Science.

I haven’t looked at this very much yet. It appears to be a way of weeding out potential variable relationships for further exploration. Because it’s nonparametric, relationships don’t have to be linear, and spurious relationships are controlled with a false discovery rate method. A jar file and R file are both provided.

Monday, December 5, 2011

From datasets to algorithms in R

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
You can specify interactions and nesting.
You can automatically create a numerical matrix for a formula using model.matrix(formula).
Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
    # predict.bic.glm
    # Purpose: predict new values from a bma fit with values in a new dataframe
    #
    # Arguments:
    # bma.fit - an object fit by bic.glm using the formula method
    # new.data - a data frame, which must have variables with the same names as the independent
    #             variables as was specified in the formula of bma.fit
    #             (it does not need the dependent variable, and ignores it if present)
    # inv.link - a vectorized function representing the inverse of the link function
    #
    # Returns:
    # a vector of length nrow(new.data) with the conditional probabilities of the independent
    # variable being 1 or TRUE
    # TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit$call

    form <- formula(bma.fit$call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
    des.matrix <- model.matrix(form,new.data)
    pred <- des.matrix %*% matrix(bma.fit$postmean,nc=1)
    pred <- inv.link(pred)
    return(pred)
}

The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.

In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.