Ordinary Least Squares (OLS)

#optimization_algorithm #normal_equation

The Ordinary Least Squares (OLS) is an analytical method used to compute the θ parameters of the hypothesis function.

According to the Gauss Markov theorem, there are assumptions to meet in order to guarantee the validity of OLS for estimating the coefficients of a regression:

Linearity,

Normality,

Homoscedasticity,

No-collinearity,

No-endogenity,

No-autocorrelation,

Random selection,

...

Lack of knowledge of these assumptions could result in incorrect results.

⚠️ This method can use feature scaling, but it doesn't need it.

Univariate OLS Linear Regression

Given the hypothesis function of a univariate linear regression:

h_{θ} (x) = θ_{0} + θ_{1} x + ε

One can compute the unbiased estimators ${\hat{θ}}_{0}$ and ${\hat{θ}}_{1}$ of $θ_{0}$ and $θ_{1}$ :

\begin{aligned} {\hat{θ}}_{1} & = \frac{S_{x, y}}{S_{x}^{2}} \\ {\hat{θ}}_{0} & = \bar{y} - {\hat{θ}}_{1} \bar{y} \end{aligned}

$\begin{aligned} with S_{x, y} & = the covariance of the training inputs X and outputs Y \\ S_{x}^{2} & = the variance of the training inputs X \\ \bar{x} & = the means of the training inputs X \\ \bar{y} & = the means of the training outputs Y \end{aligned}$

Then $y$ can be estimated (predicted) with:

\hat{y} = {\hat{θ}}_{0} + {\hat{θ}}_{1} x

$\begin{aligned} with x & = the input for which one wants the predict the output y \\ {\hat{θ}}_{i} & = the computed estimator of θ_{i} \end{aligned}$

And that's done!

But various metrics can also be computed to evaluate the model.

The unexplained error of each training example:

e_{i} = y_{i} - {\hat{y}}_{i} \Leftarrow e_{i}^{2} is called the residual error

$\begin{aligned} with y_{i} & = the expected output of the i^{t h} sample \\ {\hat{y}}_{i} & = the predicted output of the i^{t h} sample \end{aligned}$

The sum of residuals $ε$ (which is the OLS cost function):

\begin{aligned} ε & = \sum_{i = 1}^{m} e_{i}^{2} \\ = \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2} \\ = \sum_{i = 1}^{m} (y_{i} - ({\hat{θ}}_{0} + {\hat{θ}}_{1} x_{i}))^{2} \end{aligned}

$\begin{aligned} with m & = the number of samples \\ e_{i}^{2} & = the residual for the error e_{i} of the i^{t h} sample \end{aligned}$

The unbiased residual variance (also called unexplained variance
or error variance):

{\hat{σ}}^{2} = \frac{1}{m - 1 - 1} \sum_{i = 1}^{m} e_{i}^{2}

$\begin{aligned} with m & = the number of samples \\ e_{i}^{2} & = the residual for the error e_{i} of the i^{t h} sample \end{aligned}$

Vectorized Univariate OLS Linear Regression

There exist vectorized formulas for batch predictions:

\begin{aligned} h_{θ} (x) & = θ_{0} [\begin{array}{c} 1 \\ ⋮ \\ 1 \end{array}] + θ_{1} [\begin{array}{c} x_{0} \\ ⋮ \\ x_{m} \end{array}] + [\begin{array}{c} e_{0} \\ ⋮ \\ e_{m} \end{array}] \\ Y & = θ_{0} 1_{m} + θ_{1} X + ε \\ ε & = | | Y - (θ_{0} 1_{n} - θ_{1} X) | |_{l 2}^{2} \end{aligned}

$\begin{aligned} with | | . | |_{l 2} & = the euclidean norm \\ Y & = the ouput values \\ X & = the input values \\ 1_{m} & = an array filled with 1, of size m \\ m & = he number of samples \end{aligned}$

Although this approach is very simple and works very well with univariate linear regressions, this is not the case with the multivariate version.

Multivariate OLS Linear Regression

Given the hypothesis function of a Multivariate linear regression
with $p$ predictors (or features):

h_{θ} (x) = θ_{0} x_{0} + θ_{1} x_{1} + \dots + θ_{p} x_{p} + ε

One can compute the unbiased estimators ${\hat{θ}}_{0} \dots {\hat{θ}}_{p}$ of $θ_{0} \dots θ_{p}$ using the normal equation with:

\hat{θ} = (X^{T} X)^{- 1} X^{T} Y

$\begin{aligned} with X & = a matrix of i.i.d. training examples arranged as rows \\ Y & = a vector of all the expected output values \end{aligned}$

⚠️ X must be invertible.
And otherwise a pseudo inverse matrix can be used instead.

Then $y$ for a single new input $x$ can be estimated (predicted) with:

{\hat{y}}_{i} = {\hat{θ}}_{0} x_{i 0} + {\hat{θ}}_{1} x_{i 1} + \dots + {\hat{θ}}_{p} x_{i p}

$\begin{aligned} with x_{i} & = the input for which one wants the predict the output y \\ \hat{θ} & = a vector of all the estimators computed with the normal equation \\ p & = the number of predictors / features \end{aligned}$

Or a batch of predictions can be made using the vectorized version:

\begin{aligned} \hat{Y} & = \hat{θ} \cdot X \\ [\begin{array}{c} {\hat{y}}_{1} \\ ⋮ \\ {\hat{y}}_{m} \end{array}] & = [\begin{array}{c} {\hat{θ}}_{1} \\ ⋮ \\ {\hat{θ}}_{m} \end{array}] [\begin{array}{c} x_{1 1} & \dots & x_{1 p} \\ ⋮ & ⋮ & ⋮ \\ x_{m 1} & \dots & x_{m p} \end{array}] \end{aligned}

$\begin{aligned} with X & = a matrix of i.i.d. features arranged as rows \\ \hat{θ} & = a vector of all the estimators computed with the normal equation \\ p & = the number of predictors / features \\ m & = the number of samples \end{aligned}$

And that's done!

And similarly to the univariate linear regression various metrics can be computed to evaluate the model (e.g. : statistical tests).

The unexplained error of each training example:

\begin{aligned} e_{i} & = y_{i} - {\hat{y}}_{i} \Leftarrow e_{i}^{2} is called the residual error \\ = Y - \hat{Y} \end{aligned}

$\begin{aligned} with y_{i} & = the expected output of the i^{t h} sample \\ {\hat{y}}_{i} & = the predicted output of the i^{t h} sample \end{aligned}$

The sum of residuals $ε$ (which is the OLS cost function):

\begin{aligned} ε & = \sum_{i = 1}^{m} e_{i}^{2} \\ = \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2} \\ = \sum_{i = 1}^{m} (y_{i} - ({\hat{θ}}_{0} x_{0} + {\hat{θ}}_{1} x_{1} + \dots + {\hat{θ}}_{p} x_{p}))^{2} \\ = \sum_{i = 1}^{m} (y_{i} - \sum_{j = 1}^{p} ({\hat{θ}}_{j} x_{i j}))^{2} \end{aligned}

$\begin{aligned} with m & = the number of samples \\ p & = the number of predictors / features \\ e_{i}^{2} & = the residual for the error e_{i} of the i^{t h} sample \end{aligned}$

The unbiased residual variance (also called unexplained variance or error variance):

{\hat{σ}}^{2} = \frac{1}{m - p - 1} \sum_{i = 1}^{m} e_{i}^{2}

$\begin{aligned} with m & = the number of samples \\ p & = the number of predictors / features \\ e_{i}^{2} & = the residual for the error e_{i} of the i^{t h} sample \end{aligned}$

Normal Equation complexity

One disadvantage of this approach, is that computing the inversion $(X^{T} X)^{- 1}$ has a complexity $O (n^{3})$ .

So if there is a very large number of features (e.g. more than 10000), it will be slow and it might be a good time to use an iterative process such as Gradient Descent.

Normal Equation Non-invertibility

In some cases, $X^{T} X$ may be non-invertible and the common causes are either:

redundant features (two or more features are closely related).
The solution is of course to delete one of the related features ...
too many features (e.g. $m \leq n$ , or not enough data).
The problem can be solved by deleting one or more features, or by using the Regularization process .

The machine learning libraries tends to offer a way to protect against this problem. For instance, in Numpy or Octave the pinv( ) function can be used instead of the inv( ) function.

(Bonus) Deciphering the Normal Equation

Given the vectorized hypothesis of a #Multivariate OLS Linear Regression:

\begin{aligned} h_{θ} (x) & = θ^{T} X \\ ⇓ \\ \hat{Y} & = \hat{θ} X \\ ⇓ \\ X θ & \approx Y \end{aligned}

One way to find $θ$ is to get rid of $X$ by multiplying it by its inverse (if $X$ is square) so that it turns into an Identity matrix.

\begin{aligned} X^{- 1} X θ & \approx X^{- 1} Y \\ 1 \times θ & \approx X^{- 1} Y \Leftarrow if X is square \\ θ & \approx X^{- 1} Y \end{aligned}

$with (X^{- 1} X) = an$ Identity matrix

Fortunately, the gram matrix is always square... so one can compute the Gram matrix $(X^{T} X)$ , then it's Identity matrix $(X^{- 1} X)$ to get the $θ$ expression.

\begin{array}{r} (X^{T} X)^{- 1} (X^{T} X) θ \approx (X^{T} X)^{- 1} (X^{T} Y) \\ 1 \times θ \approx (X^{T} X)^{- 1} (X^{T} Y) \end{array}

$with (X^{T} X) = a$ Gram matrix
$(X^{- 1} X) = an$ Identity matrix
$(X^{T} X)^{- 1} (X^{T} X) = an Identity matrix of a Gram matrix$

So, if the Gram matrix is invertible, the Normal equation finally appears:

θ \approx (X^{T} X)^{- 1} (X^{T} Y)