Monday, December 5, 2011

From datasets to algorithms in R

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

  • You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
  • You can specify interactions and nesting.
  • You can automatically create a numerical matrix for a formula using model.matrix(formula).
  • Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
    # predict.bic.glm
    #  Purpose: predict new values from a bma fit with values in a new dataframe
    #
    # Arguments:
    #  bma.fit - an object fit by bic.glm using the formula method
    #  new.data - a data frame, which must have variables with the same names as the independent
    #             variables as was specified in the formula of bma.fit
    #             (it does not need the dependent variable, and ignores it if present)
    #  inv.link - a vectorized function representing the inverse of the link function
    #
    # Returns:
    #  a vector of length nrow(new.data) with the conditional probabilities of the independent
    #  variable being 1 or TRUE
    # TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit$call
   
    form <- formula(bma.fit$call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
    des.matrix <- model.matrix(form,new.data)
    pred <- des.matrix %*% matrix(bma.fit$postmean,nc=1)
    pred <- inv.link(pred)
    return(pred)
}

The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.

In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.