Monday, December 5, 2011

From datasets to algorithms in R

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

• You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
• You can specify interactions and nesting.
• You can automatically create a numerical matrix for a formula using model.matrix(formula).
• Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(bma.fit,new.data,inv.link=plogis) {
# predict.bic.glm
#  Purpose: predict new values from a bma fit with values in a new dataframe
#
# Arguments:
#  bma.fit - an object fit by bic.glm using the formula method
#  new.data - a data frame, which must have variables with the same names as the independent
#             variables as was specified in the formula of bma.fit
#             (it does not need the dependent variable, and ignores it if present)
#  inv.link - a vectorized function representing the inverse of the link function
#
# Returns:
#  a vector of length nrow(new.data) with the conditional probabilities of the independent
#  variable being 1 or TRUE
# TODO: make inv.link not be specified, but rather extracted from glm.family of bma.fit\$call

form <- formula(bma.fit\$call\$f)[-2] # extract formula from the call of the fit.bma, drop dep var
des.matrix <- model.matrix(form,new.data)
pred <- des.matrix %*% matrix(bma.fit\$postmean,nc=1)