#loss_function #cost_function

Approximation

J(θ) = E [ (YY^)2 ]

with Y = the ground truth output values for the training examplesY^ = the predicted ouput values for the training examplesE[z] = the mean estimator: X¯=1mi=1mXi

Expanded

The MSE expression is often divided by 2 to make derivative calculations simpler and hence speed-up the gradient descent.

J(θ) = 12 mi=1m(yi  y^i)2 = 12 mi=1m(yi  hθ(xi))2

with m = the number of training examplesxi = the input (feature) of the ith training exampleyi = the ground truth output of the ith training examplehθ(x) or y^i = the predicted ouput of the ith training example

The 12m factor isn't required, but it turns the cost function to a good approximation of the "generalization error" for a randomly chosen new sample (not in the TRAINING SET).

Adding this factor or not doesn't affect the final result at all since the minimization / optimization process is unaffected by constants.

Vectorized

J(θ) = 12 m( Xθ  y )T ( Xθ  y )

with X = a matrix of the training examples arranged as rows of Xy = a vector of all the expected output values