This was the first of two lectures on the theory and practice of regression.

We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”). The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors. This includes everything from looking at histograms and scatter plots to building statistical models.

We focused on the latter and discussed ordinary least squares regression. First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood. Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.

Invert normal equations $N K + K^2$ $K^3$ Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features
Gradient descent $N K$ $NK$ per step Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate
Stochastic gradient descent $K$ $K$ per step Good for datasets that exceed available memory; more sensitive to learning rate schedule