Lecture 13: Classification

In this lecture we discussed classification methods for predicting discrete-valued targets (e.g., spam classification or gender identification). We noted several potential issues in directly applying linear regression to classification problems and explored naive Bayes and logistic regression as alternatives.

We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but perhaps counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-feature classifier. We improved upon this by considering naive Bayes—a simple linear method for classification in which we model each feature independently. While the independence assumption is almost definitely incorrect, naive Bayes turns out to work well in practice. In addition, naive Bayes is simple to train and predict with at scale. Unfortunately, however, it does fail to account for correlations amongst features.

Logistic regression addresses this issue by modeling the class-conditional probabilities directly, using a logistic function to transform predictions from a linear model to lie in the unit interval: $$ p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}} $$ While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations, similar to linear regression: $$ \hat{w} \leftarrow \hat{w} + \eta X^T (y – p). $$ In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration. As with regression, some form of regularization is often useful for balancing the fit to training data with generalization error when one has a relatively large number of features.

References include Chapter 4 of Bishop, Chapter 4 of Hastie, Chapter 6 of Segaran, Horvitz, et. al., 1998, Lewis, 1998, Graham, 2002, and Metsis, et. al., 2006.