These lectures explored both the practice of and theory behind applied regression analysis. As stated by Cook & Weisberg, regression analysis aims “to understand, as far as possible with the available data, how the conditional distribution of the response varies across subpopulations determined by the possible values of the predictor(s).” This includes everything from simple descriptive statistics discussed in our earlier lectures to more sophisticated statistical models for situations in which we have a small number of observations relative to the number of subpopulations. As Berk explains in his constructive critique, the aim of such analyses may vary from providing a compact description of the data, to predicting future outcomes, to explaining associations between outcomes and predictors.

We discussed two case studies: Predicting consumer behavior with Web search and A Large-Scale Study of Browsing Behavior. The first explored the relative value of various data sources for predicting the performance of movies, music, and video games, using simple auto-regressive models to generate future sales predictions from current and past performance. The second study investigated variation in how different demographic groups use the Web, making use of linear models to smooth descriptive statistics over subpopulations and control for covariation between predictors (e.g., education and race).

In the following lecture we reviewed the formalism behind ordinary least squares regression, phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent and stochastic gradient descent, for larger-scale settings.

Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:

$$ \hat{w} = (X^T X)^{-1} X^T y.$$

In practice, however, it often computationally expensive to do necessary the matrix inversion for models with many features, even with specialized numerical methods.

Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:

$$ \hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}), $$

where eta is a tunable step size. Choosing eta too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of the weight vector by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels. Stochastic gradient descent provides an even more scalable alternative where an approximation to the gradient from a subsample of the data is used in place of the entire dataset.

See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.