Thursday, December 22, 2011

A statistician’s view of Stanford’s open Introduction to Databases class

In addition to the Introduction to Machine Learning class (which I have reviewed), I took a class on introduction to databases, taught by Prof. Jennifer Widom. This class consisted of video lectures, review questions, and exercises. Topics covered included XML (markup, DTDs, schema, XPath, XQuery, and XSLT) and relational databases (relational algebra, database normalization, SQL, constraints, triggers, views, authentication, online analytical processing, and recursion). At the end we had a quick lesson on NoSQL systems just to introduce the topic and discuss where they are appropriate.

This class was different in structure from the Machine Learning class in two ways: there were two exams and the potential for a statement of accomplishment.

I think any practicing statistician should learn at least the material on relational databases, because data storage and retrieval is such an important part of statistics today. Many different statistical packages now connect to relational databases through technologies like ODBC, and knowledge of SQL can enhance the use of these systems. For example, for subset analyses it is usually better to do subsetting with the database than pull all the data into the statistical analysis package and then perform the subset. In biostatistics, the data are usually collected in a database using an electronic data capture or paper-based system, which store the data in an Oracle database.

I found that I already use the material in this course, even though I typically don’t write SQL queries more complicated than a simple subset. Some of the examples for the exercises involved representing a social network, which may help when I do my own analyses of networks. Other examples were of relationships that you might find in biostatistics and other fields.

I found that by spending up to 5 hours a week on the class I was able to get a lot out of it. Unfortunately, Stanford is not offering this in the winter quarter, but they have promised to offer this again in the future. I heartily recommend it for anyone practicing statistics, and the price is right.