
Given samples $x^{(i)}$ and labels $y^{(i)}$ find a function $f$ such that $f(x^{(i)}) \approx y^{(i)}$




$$f(x) = w_0 + \sum_{k=1}^n w_i x_i,$$
where the parameters are determined by solving
$$\min_{w} \sum_{i=1}^m \left(f(x_i) - y_i\right)^2$$
For classification $y_i \in \{-1,1\}$ we take
$$f(x) = \sigma\bigl(Wx + b\bigr),$$
with $\sigma(y) = (1 + \exp(-y))^{-1}$, and determine $w$ by solving
$$\min_{w} \sum_{i=1}^m \left(1 - y_if(x_i)\right)^2.$$
$$f(x) = f_n \circ f_{n-1} \circ \ldots \circ f_1(x),$$ with $$f_k(x) = \sigma \left(W_k x + b_k\right).$$
$$f(x) = \sum_{i=1}^m w_i k(x_i,x),$$
Find a minimizer of a cost function $$C(w) = \sum_{i=1}^m L(f_i(w),y_i) + R(w),$$
where
Let
This leads to a least-squares problem
$$C(w) = \|Xw - y\|_2^2 + \beta\|w\|_2^2,$$
with a closed-form solution
$$\widehat{w} = \left(X^T\!X + \beta I\right)^{-1}X^Ty.$$
The optimization problem
$$\min_w \sum_{i=1}^m L(f_i(w),y_i) + R(w),$$
Furthermore:
Three properties come up frequently in optimization:
A function $C : \mathbb{R}^n \rightarrow \mathbb{R}$ is Lipschitz continuous with Lipschitz constant $\rho$ if $\forall \,w,w' \in \mathbb{R}^n$
$$\left|C(w) - C(w')\right| \leq \rho \|w - w'\|_2.$$

A function $C : \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if $\forall w,w' \in \mathbb{R}^n$ and $\forall t \in [0,1]$ we have
$$C(tw + (1-t)w') \leq tC(w) + (1-t)C(w').$$

A function $C : \mathbb{R}^n \rightarrow \mathbb{R}$ is $\mathcal{C}^k$-smooth if all partial derivatives or order $\leq k$ exist and are continuous.
For $k \geq 1$:
For $k \geq 2$:

Note:

Smooth function:
Convex function:

Basic template (for smooth cost functions)
$$w_{k+1} = w_k + \alpha_k d_k,$$
where $\alpha_k$ is the stepsize and the search direction is constructed from the gradient:
Note that the search direction needs to be descent direction: $\langle \nabla C(w_k), d_k \rangle < 0$.
Steepest descent with a fixed stepsize ($\alpha_k \in (0,2/\ell)$):
The iterates converge to a stationary point from any initial point
For convex functions, the iterates converge to a minimum at a sub-linear rate of $\mathcal{O}(1/k)$
For strongly convex functions, the iterates converge to the minimum at a linear rate $\mathcal{O}(\rho^{k})$
To reach a point $w_k$ for which $|C(w_k) - C(w_*)| \leq \epsilon$ we need

In practice we need to trade-off convergence rates, robustness and computational cost:
We can use sub-gradients:
$$w_{k+1} = w_k - \alpha_k\left(\nabla L(w_k) + \partial R(w_k) \right),$$
and get a sub-linear convergence rate
...or modify the basic iteration to retain a linear convergence rate when $L$ is strongly convex
$$w_{k+1} = \mathrm{prox}_{\alpha_k R}\left(w_k - \alpha_k\nabla L(w_k)\right),$$
where $\text{prox}_{\alpha R}(z) = \text{argmin}_x \, R(x) + \frac{1}{2\alpha}\|z - x\|_2^2$.


Note that when $R$ is the indicator of a convex set, the proximal operator projects onto the convex set.




Choosing an appropriate regularization parameter is its own optimization problem!

These examples indicate that the resulting $f$ is very sensitive;
$$|f(x) - f(y)| \leq C |x - y|,$$
for very large constant $C$.
When $C$ and $R$ are convex, the following three problems are equivalent
$$\min_{w} C(w) + \lambda R(w),$$
$$\min_{w} C(w) \quad \text{s.t.} \quad R(w) \leq \tau,$$
$$\min_{w} R(w) \quad \text{s.t.} \quad C(w) \leq \sigma.$$
The parameters $\tau$ and $\sigma$ are connected through the Pareto curve
To find the $\tau$ corresponding to a given $\sigma$ we solve a rootfinding problem with the value function
$$\phi(\tau) = \min_{w} C(w) \quad \text{s.t.} \quad R(w) \leq \tau.$$

The function has a recursive structure: $$f(x_0) = x_n, \quad x_k = \sigma\left(W_{k-1} x_{k-1}+ b_{k-1}\right).$$
$$\Delta x_{k} = \sigma'\left(W_{k-1} x_{k-1}+ b_{k-1}\right)\cdot (W_{k-1}\Delta x_{k-1}).$$
We need to evaluate the objective at only one sample at a time: $$w_{k+1} = w_k - \alpha_k \nabla C(w_k; x_i),$$
or use a batch $$w_{k+1} = w_k - \frac{\alpha_k}{|\mathcal{I}_k|} \sum_{i\in\mathcal{I}_k}\nabla C(w_k; x_i)$$
We can interpret this algorithm as gradient-descent on the full objective that includes all $m$ training-samples with a noisy gradient: $\nabla C(w) + E_k$.
Full gradient: $$\nabla C(w) = \frac{1}{m}\sum_{i=1}^m \nabla C(w,x_i).$$
can be approximated using a sample-average $$\nabla C(w) \approx \frac{1}{|\mathcal{I}|}\sum_{i=\in\mathcal{I}} \nabla C(w,x_i)$$
Main assumptions:
Basic iteration produces iterates for which $$\mathbb{E}(C(w_{k+1}) - C(w_*)) \leq \left(1 - 2\mu\alpha_k + \mu\ell\alpha_k^2\right)\left(C(w_{k}) - C(w_*)\right) + \frac{\alpha_k^2 \sigma ^2\ell }{2}.$$
$$\mathbb{E}(C(w_{k+1}) - C(w_*)) \leq \left(1 - 2\mu\alpha_k + \mu\ell\alpha_k^2\right)\left(C(w_{k}) - C(w_*)\right) + \frac{\alpha_k^2 \sigma ^2 \ell}{2}.$$
We find

Linear convergence without stepsize reduction can be retained by variance reduction:
Further improvements:
Project iterates onto principal components of
$$M = \left(w_0 - w_n, w_{1} - w_{n}, \ldots, w_{n-1} - w_n\right),$$
and plot objective along dominant directions.

