Lectures 8 & 9: Classification | Modeling Social Data

This post covers two lectures on classification, the first a guest lecture from Chris Wiggins.

Chris opened his guest lecture by introducing the problem of classification, where the outcome is categorical (e.g., whether an email is spam or ham) rather than continuous. We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but somewhat counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-word spam classifier. We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains. Chris concluded with a unifying overview of various loss functions and derived boosting under expotential loss.

Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice. It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect. That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.

The first issue arises when thinking about how to estimate word probabilities. Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past. We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of Laplace smoothing, or adding pseudocounts to empirical word counts to prevent overfitting. As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.

As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression. Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features. We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent and Newton’s methods for solutions. The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function. And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.

We concluded with a discussion of several metrics for evaluating classifiers, including calibration, confusion matrices, accuracy, precision and recall, and the ROC curve.

A few references:

Chapter 12 of Advanced Data Analysis from an Elementary Point of View
Chapter 4 of An Introduction to Statistical Learning
Naive Bayes at 40 by Lewis (1998)
Idiots Bayes—Not So Stupid After All? by Hand and Yu (2001)
A Bayesian Approach to Filtering Junk E-mail from Sahami, Dumais, Heckerman, and Horvitz (1998)
A Plan for Spam by Paul Graham (2002)
An introduction to ROC analysis