# Lectures 11 & 12: Regression

These lectures explored both the practice of and theory behind applied regression analysis. As stated by Cook & Weisberg, regression analysis aims “to understand, as far as possible with the available data, how the conditional distribution of the response varies across subpopulations determined by the possible values of the predictor(s).” This includes everything from simple descriptive statistics discussed in our earlier lectures to more sophisticated statistical models for situations in which we have a small number of observations relative to the number of subpopulations. As Berk explains in his constructive critique, the aim of such analyses may vary from providing a compact description of the data, to predicting future outcomes, to explaining associations between outcomes and predictors.

We discussed two case studies: Predicting consumer behavior with Web search and A Large-Scale Study of Browsing Behavior. The first explored the relative value of various data sources for predicting the performance of movies, music, and video games, using simple auto-regressive models to generate future sales predictions from current and past performance. The second study investigated variation in how different demographic groups use the Web, making use of linear models to smooth descriptive statistics over subpopulations and control for covariation between predictors (e.g., education and race).

In the following lecture we reviewed the formalism behind ordinary least squares regression, phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent and stochastic gradient descent, for larger-scale settings.

Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:
$$\hat{w} = (X^T X)^{-1} X^T y.$$
In practice, however, it often computationally expensive to do necessary the matrix inversion for models with many features, even with specialized numerical methods.

Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:
$$\hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}),$$
where eta is a tunable step size. Choosing eta too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of the weight vector by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels. Stochastic gradient descent provides an even more scalable alternative where an approximation to the gradient from a subsample of the data is used in place of the entire dataset.

See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.

# Lecture 13: Classification

In this lecture we discussed classification methods for predicting discrete-valued targets (e.g., spam classification or gender identification). We noted several potential issues in directly applying linear regression to classification problems and explored naive Bayes and logistic regression as alternatives.

We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but perhaps counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-feature classifier. We improved upon this by considering naive Bayes—a simple linear method for classification in which we model each feature independently. While the independence assumption is almost definitely incorrect, naive Bayes turns out to work well in practice. In addition, naive Bayes is simple to train and predict with at scale. Unfortunately, however, it does fail to account for correlations amongst features.

Logistic regression addresses this issue by modeling the class-conditional probabilities directly, using a logistic function to transform predictions from a linear model to lie in the unit interval: $$p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}}$$ While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations, similar to linear regression: $$\hat{w} \leftarrow \hat{w} + \eta X^T (y – p).$$ In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration. As with regression, some form of regularization is often useful for balancing the fit to training data with generalization error when one has a relatively large number of features.

References include Chapter 4 of Bishop, Chapter 4 of Hastie, Chapter 6 of Segaran, Horvitz, et. al., 1998, Lewis, 1998, Graham, 2002, and Metsis, et. al., 2006.