Gradient Descent

#optimization_algorithm

The gradient descent is an iterative method used to automatically guess the optimal θ parameters of $h_{θ}$ using a cost function $J (θ)$ .

The idea is to take repeated steps in the opposite direction (when we try to minimize) of the approximate gradient of the function at the current point (because this is the direction of steepest descent), until it doesn't change anymore.

However, the gradient descent doesn't guarantee to find the global optima (the best parameters), and it can end-up in a local optima or another one depending on the starting point.

⚠️ It is recommended to use feature normalization to speed up the descent.

Four steps gradient descent example

General Batch Gradient Descent algorithm

$Repeat until convergence$
${$

θ_{j} := θ_{j} + α \frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1} \dots θ_{n})

$}$

$\begin{aligned} with α & = the learning rate \\ \frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1} \dots θ_{n}) & = the derivative of the cost function \end{aligned}$

Normally, the derivative of the Mean Squared Error (MSE) cost function is as follows:

\frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1}) = \frac{2}{m} \sum_{i = 1}^{m} ({\hat{y}}_{i} - y_{i}) = \frac{2}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i})

But if the cost function expression is already divided by 2 (see Mean Squared Error (MSE)), there is no need to do it again, and the formula becomes:

\frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1}) = \frac{1}{m} \sum_{i = 1}^{m} ({\hat{y}}_{i} - y_{i}) = \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i})

Batch Gradient Descent for univariate linear regression

$Repeat until convergence$
${$

\begin{aligned} θ_{0} & := θ_{0} + α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) \\ θ_{1} & := θ_{1} + α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) x_{i} \end{aligned}

$}$

$\begin{aligned} with α & = the learning rate \\ m & = the number of training examples \\ x_{i} & = the input (feature) of the i^{t h} training example \\ h_{θ} (x_{i}) & = the computed ouput of the i^{t h} training example \\ y_{i} & = the expected output of the i^{t h} training example \end{aligned}$

Batch Gradient Descent for multivariate linear regression

$Repeat until convergence$
${$

\begin{aligned} θ_{0} & := θ_{0} + α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) x_{i 0} \\ θ_{1} & := θ_{1} + α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) x_{i 1} \\ ⋮ \\ θ_{n} & := θ_{n} + α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) x_{i n} \end{aligned}

$}$

$\begin{aligned} with α & = the learning rate \\ n & = the number of features \\ m & = the number of training examples \\ x_{i} & = the input (feature) of the i^{t h} training example \\ h_{θ} (x_{i}) & = the computed ouput of the i^{t h} training example \\ y_{i} & = the expected output of the i^{t h} training example \\ y_{i j} & = the value of feature j in the i^{t h} training example \end{aligned}$

Vectorized Batch Gradient Descent

θ := θ - α \frac{1}{m} X^{T} (X θ - \vec{y})

$\begin{aligned} with α & = the learning rate \\ m & = the number of training examples \\ X & = a matrix of the training examples arranged as rows of X \\ \vec{y} & = a vector of all the expected output values \end{aligned}$

The “default” gradient descent algorithm is the batch gradient descent algorithm which uses all the TRAINING EXAMPLES training examples available.

⚠️ TODO : Add other types of gradient descents HERE.

Debugging Gradient Descent

The simplest way to debug a gradient descent is to make a plot with the number of iterations and the Cost function $J (θ)$ .

The learning rate $α$ must neither be too large nor too low.

If the learning rate $α$ is too large, (like the ...orange... lines above) the gradient descent will have difficulties converging.
If the learning rate $α$ is too low, the gradient descent will eventually converge, but it will take time...
If the learning rate $α$ is just right, the gradient descent will be in its best configuration to quickly converge.

So, one good way to choose $α$ is to start with a small value and to progressively increase it by almost 3 (… 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 …) as long as $J (θ)$ do not increase.