Mean Squared Error (MSE)

Approximation

J (θ) = E [(Y - \hat{Y})^{2}]

$\begin{aligned} with Y & = the ground truth output values for the training examples \\ \hat{Y} & = the predicted ouput values for the training examples \\ E [z] & = the mean estimator: \bar{X} = \frac{1}{m} \sum_{i = 1}^{m} X_{i} \end{aligned}$

Expanded

The MSE expression is often divided by 2 to make derivative calculations simpler and hence speed-up the gradient descent.

J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2} = \frac{1}{2 m} \sum_{i = 1}^{m} (y_{i} - h_{θ} (x_{i}))^{2}

$\begin{aligned} with m & = the number of training examples \\ x_{i} & = the input (feature) of the i^{t h} training example \\ y_{i} & = the ground truth output of the i^{t h} training example \\ h_{θ} (x) or {\hat{y}}_{i} & = the predicted ouput of the i^{t h} training example \end{aligned}$

The $\frac{1}{2 m}$ factor isn't required, but it turns the cost function to a good approximation of the "generalization error" for a randomly chosen new sample (not in the TRAINING SET).

Adding this factor or not doesn't affect the final result at all since the minimization / optimization process is unaffected by constants.

Vectorized

J (θ) = \frac{1}{2 m} (X θ - \vec{y})^{T} (X θ - \vec{y})

$\begin{aligned} with X & = a matrix of the training examples arranged as rows of X \\ \vec{y} & = a vector of all the expected output values \end{aligned}$