Given $m$ samples $x^{(i)} \in \mathbb{R}^n$, find a simple description / summary of the dataset.

Fit a normal distribution to the samples
$$\max_{\mu,\Sigma} \prod_{i=1}^m(2\pi)^{-n/2}\text{det}(\Sigma)^{-1/2}\exp\left(-\frac{1}{2}(x^{(i)} - \mu)^T\Sigma^{-1}(x^{(i)} - \mu)\right)$$
leads to $$\min_{\mu,\Sigma}\,\, m\log(|\Sigma|) + \sum_{i=1}^m(x^{(i)} - \mu)^T\Sigma^{-1}(x^{(i)} - \mu)$$
which gives $$\widehat{\mu} = \frac{1}{m}\sum_{i=1}^m x^{(i)}, \quad \widehat{\Sigma} = \frac{1}{m}\sum_{i=1}^m \left(x^{(i)} - \widehat{\mu}\right)\left(x^{(i)} - \widehat{\mu}\right)^T.$$

Here, the goal is to fit a probability distribution of the form $$f(x) = \sum_{i=1}^k w_i f_i(x),$$ where $f_i(x)$ is a Normal distribution with paramaters $(\mu_i, \Sigma_i)$.
Can be obtained by solving $$\min_{\mu, \Sigma} -\log\left(\sum_{i=1}^k w_i f_i(x)\right).$$

Given a sample-covariance matrix $C$, find an approximation, $\Sigma$, that has a sparse inverse:
$$\min_{\Sigma \succ 0} -\log|\Sigma| + \text{trace}(C\Sigma) + \alpha \|\Sigma\|_1.$$

Organize the data in a matrix $X = (x^{(1)}, x^{(2)}, \ldots, x^{(m)})$
A matrix $X \in \mathbb{R}^{n \times m}$ has a singular value decomposition $$X = U S V^T,$$ with $U \in \mathbb{R}^{n\times n}$, $V \in \mathbb{R}^{m\times m}$ are unitary matrices and $S$ is a diagonal matrix with non-negative entries.
$$X \approx \widetilde{X} = U_p S_p V_p^T.$$



Decompose a matrix in non-negative components
$$\min_{U,V} \|X - UV^T\|_{F}^2, \quad \text{s.t.} \quad u_{ij}, v_{ij} \geq 0$$
$$\min_{\widetilde{X}} L(X - \widetilde{X}) \quad \text{s.t.} \quad \text{rank}(\widetilde{X}) \leq p,$$
where $L$ is a robust penalty (e.g., $L(R) = \|R\|_1$).
Given partial knowledge of the entries of $X$, fill in the missing ones assuming a low-rank structure of $X$
$$\min_{W}\, \text{rank}(W) \quad \text{s.t.} \quad w_{ij} = x_{ij} \, \text{for}\, (i,j) \in \Omega.$$


Many problems in unsupervised learning involve
Goal is to decompose a matrix $$X = USV^T,$$ with $U$ and $V$ unitary and $S$ a non-negative diagonal matrix.
Observation: repeated application of a matrix to a vector will align the vector with the largest eigenvector



Observation: distances between points are preserverd under random projection (Johnson-Lindenstrauss Lemma).
There is a linear map $P \in \mathbb{R}^{k\times n}$ with $k > 8\log(m/\epsilon^2)$ such that $$(1 - \epsilon)\|x^{(i)} - x^{(j)}\|_2 \leq \|Px^{(i)} - Px^{(j)}\|_2 \leq (1 + \epsilon)\|x^{(i)}- x^{(j)}\|_2.$$
Important factos when choosing a method are
Both covariance estimation and matrix decomposition leads to optimization problems of the form
$$\min_{Z} \, L(Z) + R(Z),$$
Manopt, McTorch)



$$\min_{x\in\mathbb{R}^2}\, f(x) \quad \text{s.t.} \quad \|x\|_2 = 1.$$


Many of the programs discussed earlier can now be formulated as
$$\min_X\, L(X) + \alpha \|X\|_*,$$
where $\|X\|_*$ denotes the nuclear norm which is the sum of the singular values.
Solve optimization problem $$\min_{Z} C(Z),$$ by updating one part of $Z$ at a time.


