In Progress
Neural Networks
Depth - length of longest path from source to sink Layer - Set of all neurons which are all at the same depth with respect to the source
Gradient
For a scalar function $f(X)$ with a multivariate input $X$:
\[df\left( x\right) =\nabla _{x}f\left( x\right) dx\] \[\nabla _{x}f\left( x\right) =\left[ \dfrac{\partial f}{\partial x_{1}}\dfrac{\partial f}{\partial x_{2}}\ldots \dfrac{\partial f}{\partial x_{n}}\right]\]Because it is a dot product, for an increment $dX$ of any given length, $df$ is maximum if the increment $dX$ is aligned with the gradient direction $\nabla _{x}f\left( x\right)^T$. So the gradient is the direction of steepest ascent.
So if you want a function to decrease, your increment should be in the direction of $-\nabla _{x}f\left( x\right)^T$.
To find a maximum move in the direction of gradient:
\[x^{k+1} = x^k + \eta^k \nabla_x f(x^k)^T \\[1em]\]To find a minimum move in the direction of gradient:
\[x^{k+1} = x^k - \eta^k \nabla_x f(x^k)^T \\[1em]\]There are many solutions to choosing step size $\eta^k$.
Hessian
\[\nabla^2_{x} f(x_1, \ldots, x_n) = \begin{bmatrix}\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\\vdots & \vdots & \ddots & \vdots \\\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}\end{bmatrix}\]Network
A continuous activation function applied to an affine function of the inputs
\[y = f\left( \sum_i w_i x_i + b \right) \\[1em] y = f(x_1, x_2, \ldots, x_N; W)\]Activation Functions and Derivatives
Sigmoid:
\[f(z) = \frac{1}{1 + \exp(-z)} \\f'(z) = f(z)(1 - f(z)) \\[1.5em]\]Tanh:
\[f(z) = \tanh(z) \\f'(z) = 1 - f^2(z)\]ReLU:
\[f(z) = \begin{cases}z, & z \geq 0 \\0, & z < 0\end{cases} \\f'(z) = \begin{cases}1, & z \geq 0 \\0, & z < 0\end{cases} \]Softplus:
\[f(z) = \log(1 + \exp(z)) \\f'(z) = \frac{1}{1 + \exp(-z)}\]Softmax:
\[f(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \\ f'(z_i) = \begin{cases} f(z_i)(1 - f(z_i)), & \text{if } i = j \\ - f(z_i) f(z_j), & \text{if } i \ne j \end{cases}\]KL Divergence
Measures how different two probability distributions are. In classification, $Y$ is the predicted distribution and $d$ is the ground truth, usually a one-hot encoding.
\[KL(Y, d) = \sum_i d_i \log \frac{d_i}{y_i} \\\]For one-hot, $d$ simplifies to
\[KL(Y, d) = -\log y_c\]This means that it is minimized when correct class is predicted with the highest probability $y_c \rightarrow 1$.
Gradient:
\[\frac{d}{dy_c} KL(Y, d) = -\frac{1}{y_c}\]Training by Back Propagation
Forward Pass
Setting $y_i^{(0)} = x_i$ and $w_{0j}^{(k)} = b_j^{(k)}$. Let the bias equal $1$ for simplicity.
For layer 1,
\[z_j^{(1)} = \sum_i w_{ij}^{(1)}y_i^{(0)} \\ y_j^{(1)} = f_1(z_j^{(1)})\]For layer 2,
\[z_j^{(2)} = \sum_i w_{ij}^{(2)}y_i^{(1)} \\ y_j^{(2)} = f_2(z_j^{(2)})\]Similarly,
\[z_j^{(N)} = \sum_i w_{ij}^{(N)}y_i^{(N-1)} \\ y_j^{(N)} = f_N(z_j^{(N)})\]Backward Pass
Step 1: Initialize Gradient at Output Layer
We start by computing the gradient of the loss w.r.t. the network output:
\[\frac{\partial Div}{\partial y^{(N)}_i} = \frac{\partial Div(Y, d)}{\partial y_i}\]Then, propagate this to the pre-activation output $z^{(N)}$:
\[\frac{\partial Div}{\partial z^{(N)}_i} = \frac{\partial y_i^{(N)}}{\partial z_i^{(N)}} \cdot \frac{\partial Div}{\partial y^{(N)}_i} =f_N'\left(z^{(N)}_i\right) \cdot \frac{\partial Div}{\partial y^{(N)}_i}\]In case of a vector activation function, such as the softmax function, $y_i^{(N)}$ is influenced by every $z_i^{(N)}$:
\[\frac{\partial Div}{\partial z_i^{(N)}} = \sum_j \frac{\partial y_j^{(N)}}{\partial z_i^{(N)}} \cdot \frac{\partial Div}{\partial y_j^{(N)}}\]Step 2: Backpropagation Through Layers
Loop from layers $k = (N-1) \rightarrow 0$:
For each layer $k$ and for each neuron $i$ in that layer:
Compute gradient of loss w.r.t. activation:
\[\frac{\partial Div}{\partial y^{(k)}_i} = \sum_j w_{ij}^{(k+1)} \cdot \frac{\partial Div}{\partial z^{(k+1)}_j}\]Chain through activation function:
\[\frac{\partial Div}{\partial z^{(k)}_i} = \frac{\partial y_i^{(k)}}{\partial z_i^{(k)}} \cdot \frac{\partial Div}{\partial y^{(k)}_i} = f_k'\left(z^{(k)}_i\right) \cdot \frac{\partial Div}{\partial y^{(k)}_i}\]Step 3: Gradient w.r.t. Weights
For each weight connecting neuron $i$ in layer $k$ to neuron $j$ in layer $k$:
\[\frac{\partial Div}{\partial w_{ij}^{(k)}} = y^{(k-1)}_i \cdot \frac{\partial Div}{\partial z^{(k)}_j}\]Step 4: Updating Weights
Actual loss is the sum of the divergence over all training instances:
\[Loss = \frac{1}{|\{X\}|}\sum_X Div(Y(X), d(X))\]Actual gradient is the average of the derivatives computed for each training instance:
\[\nabla_W Loss = \frac{1}{|\{X\}|}\sum_X \nabla_W Div(Y(X), d(X))\] \[W \leftarrow W - \eta \nabla_W Loss^T\]Summary
Vector Formulation
Setup:
- Let $\mathbf{y_n = Y}$, the network output
- Let $\mathbf{y_0 = X}$, the input
- Initialize:
For each layer $k = N \rightarrow 1$:
Compute the Jacobian of activation:
\[J_{\mathbf{y}_k}(\mathbf{z}_k) = \frac{\partial \mathbf{y}_k}{\partial \mathbf{z}_k}\]This is a matrix of partial derivatives:
\[J_{\mathbf{y}}(\mathbf{z}) =\begin{bmatrix}\frac{\partial y_1}{\partial z_1} & \cdots & \frac{\partial y_1}{\partial z_D} \\\vdots & \ddots & \vdots \\\frac{\partial y_M}{\partial z_1} & \cdots & \frac{\partial y_M}{\partial z_D}\end{bmatrix}\]Backward recursion step:
\[\nabla_{\mathbf{z}_k} Div = \nabla_{\mathbf{y}_k} Div \cdot J_{\mathbf{y}_k}(\mathbf{z}_k)\] \[\nabla_{\mathbf{y}_{k-1}} Div = \nabla_{\mathbf{z}_k} Div \cdot \mathbf{W}_k\]Gradient Computation:
\[\nabla_{\mathbf{W}_k} Div = \mathbf{y}_{k-1} \cdot \nabla_{\mathbf{z}_k} Div\] \[\nabla_{\mathbf{b}_k} Div = \nabla_{\mathbf{z}_k} Div\]Summary
Important Points
- Backpropagation will often not find a separating solution even though the solution is within the class of functions learnable by the network, because the separable solution is not a feasible optimum for the loss function.
- It is minimally changed by new training instances — doesn’t swing wildly in response to small changes to the input
- It prefers consistency over perfection, due to which it works better even if there are outliers
- It is a low-variance estimator, at the potential cost of bias
- Minimizing the differentiable loss function does not imply minimizing the classification error.
Convergence of Gradient Descent
Covergence Rate
It measures how quickly an iterative optimization algorithm approaches the solution.
\[R = \left| \frac{f(x^{(k+1)}) - f(x^*)}{f(x^{(k)}) - f(x^*)} \right|\]Where:
- $x^{(k)}$: point at iteration $k$
- $x^*$: optimal point (solution)
- $f(x)$: objective function
If $R<1$, that means that the function value is getting closer to the optimum and is hence converging. The smaller the $R$, the faster the convergence.
If $R$ is a constant or upper-bounded, then the algorithm has linear convergence. This means that the difference between the function value and the optimum shrinks exponentially with iterations.
\[|f(x^{(k)}) - f(x^*)| \leq R^k \cdot |f(x^{(0)}) - f(x^*)|\]Convergence for Quadratic Surfaces
What is the optimal step size $\eta$ to reach the minimum value of the error fastest?
- Consider a general quadratic function:
- Gradient descent update:
- Taylor expansion of $E(w)$ around $w^{(k)}$:
- Newton’s method gives:
- Optimal learning rate:
Effect of step size $\eta$:
- $\eta < \eta_{opt} \rightarrow$ monotonic convergence
- $\eta = \eta_{opt} \rightarrow$ fast convergence in one step
- $\eta_{opt}<\eta < 2\eta_{opt} \rightarrow$ oscillating convergence
- $\eta \geq 2\eta_{opt} \rightarrow$ divergence
Convergence for Multivariate Quadratic Functions
- General form of a quadratic convex function:
- If $A$ is diagonal:
This means the cost function is a sum of independent univariate quadratics. Each direction $w_i$ is uncoupled.
Equal-Value Contours:
- For diagonal $A$, the contours are ellipses aligned with coordinate axes.
- Each coordinate’s behavior is independent of the others.
Optimal Step Size:
For each dimension $i$:
\[\eta_{i,\text{opt}} = \frac{1}{a_{ii}} = \left( \frac{\partial^2 E}{\partial w_i^2} \right)^{-1}\]Each coordinate has a different optimal learning rate.
Problem with Vector Update Rule:
Conventional update applies the same step size $\eta$ to all coordinates. An issue with this is that one direction might converge optimally while another direction might diverge due to too large a step.
\[\mathbf{w}^{(k+1)} = \mathbf{w}^{(k)} - \eta \nabla_{\mathbf{w}} E\]Safe Learning Rate Rule to Avoid Divergence:
\[\eta < 2 \cdot \min_i \eta_{i,\text{opt}}\]This guarantees convergence but slows down overall learning.
Generic Convex Functions
- We can apply Taylor expansion to approximate any smooth convex function:
- $H_E$ is the Hessian of second derivatives and measures the curvature of loss function
- Optimal step size is inversely proportional to eigenvalues of the Hessian
- The eigenvalues give curvature in orthogonal directions
- For the smoothest convergence, the eigenvalues must all be equal
Convergence Challenges
- In high dimensions, convergence becomes harder to control
- Ideally, the step size $\eta$ should work for both:
- If the following condition number is large, then convergence is slow and unstable.
Decaying Learning Rate
The loss surface has many saddle points and gradient descent can stagnate on saddle points
- Start with a large learning rate to explore fast
- Gradually reduce step size to fine-tune near the minimum
- Prevents overshooting and bouncing around the optimum
Common Decay Schedules:
\[\text{Linear:} \quad \eta_k = \frac{\eta_0}{k + 1}\] \[\text{Quadratic:} \quad \eta_k = \frac{\eta_0}{(k + 1)^2}\] \[\text{Exponential:} \quad \eta_k = \eta_0 \cdot e^{-\beta k}, \quad \beta > 0\]Common Approach for Neural Networks:
- Train with a fixed learning rate until the validation loss stagnates
- Reduce learning rate:
- Resume training from the same weights, repeat as needed
RProp
- Resilient Propagation is a first-order optimization algorithm that adjusts step size independently for each parameter
- It doesn’t rely on the gradient magnitude, rather only on the sign of the gradient
- It is more robust than vanilla gradient descent
- Does not need a global learning rate
- Doesn’t require Hessian or curvature information and doesn’t assume convexity
The core idea is that at each step:
- If the gradient sign has not changed → increase step size in the same direction
- If the gradient sign has flipped → reduce step size and reverse direction (overshoot detected)
Algorithm:
- For each layer $l$, for each parameter $w_{l,i,j}$:
- While not converged:
- Update parameter:
- Recompute gradient:
Check sign consistency:
If:
\[\text{sign(prevD}(l, i, j)) == \text{sign}(D(l, i, j))\]Then,
\[\Delta w_{l, i, j} = \min(\alpha \cdot \Delta w_{l, i, j}, \Delta_{\max}) \\ \text{prevD}(l, i, j) = D(l, i, j)\]Else undo step:
\[w_{l, i, j} = w_{l, i, j} + \Delta w_{l, i, j} \quad \\ \Delta w_{l, i, j} = \max(\beta \cdot \Delta w_{l, i, j}, \Delta_{\min})\]