# Linear Regression

## 1. Model Representation / Hypothesis function

The purpose of the model ${h}_{\theta }$ is to predict output values ${h}_{\theta }\left(x\right)$ from unseen input values $x$ (the features), using knowledge $\theta$ extracted from a set of data for which the output values $y$ are already known.

### Univariate Linear Regression model

The hypothesis function of a Univariate Linear Regression is trying to map a single input value (the $x$'s) to a single quantitative output value ( the $y$'s or ${h}_{\theta }\left(x\right)$ ) and it has the following format:

ε is the errors made by the model – it accounts for the fact that the model won't perfectly retrieve the original y values – ${h}_{\theta }\left({x}_{i}\right)\approx {y}_{i}$

The vector of errors ε has a normal distribution with a mean equal to 0 and an unknown variance (or covariance in a multivariate model).

### Multivariate Linear regression model

The hypothesis function of a Multivariate Linear Regression is trying to map a multiple input values (the $x$'s) to a single quantitative output value (the $y$'s or ${h}_{\theta }\left(x\right)$) and it has the following format:

Using the matrix multiplication representation, this multi-variable hypothesis function can be concisely represented as:

### Optimizing the parameters

In order to build a model one must find the $\theta$ parameters. And this is done using a Cost function along with an Optimization method.

## 2. Cost function

A cost function (or objective function) is a mathematical criterion used to measure how well the hypothesis function performed to map training examples to correct outputs.

In the best case, the line passes through all the points of the dataset and the cost (measured on the CROSS VALIDATION SET while selecting the model, then on TEST SET for final verification).

Cost functions for logistic regression problems includes :

In practice, the most common cost-function are the Mean Squared Error in production and the Root Mean Squared Error in analysis.

The maximum likelihood estimation is another kind of objective function that is often used in practice. But when such an objective function is used, the goal is to maximize it rather than minimize it.

## 3. Optimization methods

To build the model ${h}_{\theta }$ and use it to predict output values ${h}_{\theta }\left(x\right)$ from unseen input values $x$, one must find the $\theta$ parameters.

This process of finding the values for which the error $\epsilon$ made by the model gives a minimum value is called optimization.

And when it comes to solving such problems there exists both analytical and numerical approaches.

### The analytical approach

An analytical solution involves framing the problem in a well-understood form and calculating the exact solution without using an iterative method.

In the case of a linear regression the method that is usually used is the Ordinary Least Squares (OLS) which will compute the $\theta$ parameters with simple formulas, but it might be very slow with large datasets.

### The numerical approach

A numerical solution means making guesses at the solution and testing whether the problem is solved well enough to stop iterating.

In the case of a linear regression the method that is usually used is the Gradient Descent which will iterate over the $\theta$ parameters until it finds the lowest value $\epsilon$ for the error made by the model.

This technique is a little excessive for univariate regression (as OLS is simpler), but its definitely useful for multivariate regression.

There exists several algorithms to optimize the $\theta$ parameters that are more sophisticated than using Gradient Descent.

Here is a short list of the most often used optimization algorithms:

These optimization algorithms are usually faster, and they remove the hassle to manually pick $\alpha$. But in return, they are more complex to implement and it is usually highly recommended to use pre-written functions available in most machine learning libraries.

### OLS (Normal Equation) vs Gradient Descent

No need to choose $\alpha$ Need to choose $\alpha$
Slow if $n$ is larger than 10000 Works well when $n$ is large