## Formulating the PCA

Today I have thought about how one can formulate the Principal Component Analysis (PCA) method. In particular I want to reformulate PCA as a solution for a regression problem. The idea of reformulation PCA as a solution for some regression problem is useful in Sparse PCA , in which a $L_1$ regularization term is inserted into a ridge regression formula to enforce spareness of the coefficients (i.e. elastic net). There are at least two equivalent ways to motivate PCA. In this post I will first give a formulation of PCA based on orthogonal projection, and then discuss a regression-type reformulation of PCA.

1. Orthogonal projection onto optimal hyperplane:

Given $n$ $p$-dimensional column vectors $\mathbf{y}_1,\cdots, \mathbf{y}_n$ (assumed pre-processing is done $\sum_{i=1}^{n}\mathbf{y}_i = \mathbf{0}$)

Suppose that we want to project these vectors orthogonally onto a unit vector $\mathbf{a}_1$, so that the projected vectors will be:

$\mathbf{x}_i = \mathbf{a}_1\mathbf{a}_1^T\mathbf{y}_i$

If we think of the projected vectors $\mathbf{x} _i$ as an approximation of the original $\mathbf{y}_i$, then a measure of how good the overall approximation is is the sum of lengths of the discrepancies $\mathbf{y}_i - \mathbf{x}_i$. So it is natural to adopt the following sum of square as the objective function, and judge $\mathbf{a}_1$ base on how small it can make the objective function be.

$L_1=\dfrac{1}{2}\sum_{i=1}^{n}(\mathbf{y}_i - \mathbf{x}_i)^T (\mathbf{y}_i - \mathbf{x}_i)$

Expanding $L_1$ yields

$L_1 = \sum_{i=1}^n\mathbf{y}_i^T\mathbf{y} - \mathbf{a}_1^T\sum_{i=1}^n\mathbf{y}_i\mathbf{y}_i^T\mathbf{a}_1$

Let $\mathbf{\Omega} = \dfrac{1}{n}\sum_{i=1}^n\mathbf{y}_i\mathbf{y}_i^T$. Choosing $\mathbf{a}_1$ to minimize $L_1$ is equivalent to choosing $\mathbf{a}_1$ to maximize $\mathbf{a}_1^T\mathbf{\Omega} \mathbf{a}_1$ with the condition $\mathbf{a}^T\mathbf{a} = 1$. With a change of variable, one can prove that the optimal $\mathbf{a}_1$ is the (normalized) eigenvector $\mathbf{v}_1$ of $\mathbf{\Omega}$corresponding to the largest eigenvalue of $\mathbf{\Omega}$.

Suppose after we have fixed $\mathbf{a}_1$ at $\mathbf{v}_1$, another unit vector $\mathbf{a}_2$ with the property $a_2 \perp a_1$ is given, and we are asked how to choose the best $\mathbf{a}_2$ if the same orthogonal projection is carried out onto $\mathbf{a}_2$. With the same reasoning, the best $\mathbf{a}_2$ is the one that maximizing $\mathbf{a}_2^T\mathbf{\Omega}\mathbf{a}_2$ with subject to $\mathbf{a}_2^T\mathbf{a}_2 = 1$ and $\mathbf{a}_2^T\mathbf{a}_1 = 0$. Again with a change of variable, it can be proved that the best $\mathbf{a}_2$ is the (normalized) eigenvector $\mathbf{v}_2$ of $\mathbf{\Omega}$corresponding to the second largest eigenvalue of $\mathbf{\Omega}$.

A natural question to ask is whether $\mathbf{v}_1$ and $\mathbf{v}_2$ are still the best if we consider orthogonal projection onto the plane spanned by two orthogonal unit vectors $\mathbf{a}_1$,$\mathbf{a}_2$.

In other words, we consider the following objective function

$L_{plane} = \dfrac{1}{2}\sum_{i=1}^n(\mathbf{y}_i - \mathbf{a}_1\mathbf{a}_1^T\mathbf{y}_i - \mathbf{a}_2\mathbf{a}_2^T\mathbf{y}_i)^T(\mathbf{y}_i - \mathbf{a}_1\mathbf{a}_1^T\mathbf{y}_i - \mathbf{a}_2\mathbf{a}_2^T\mathbf{y}_i)$

and ask whether the following statement holds or not

$\mathbf{v}_1,\mathbf{v}_2 = \text{argmax}_{\mathbf{a}_1,\mathbf{a}_2} L_{\text{plane}}$with subject to $\mathbf{a}_1^T\mathbf{a}_1 = \mathbf{a}_{2}^T\mathbf{a}_2 = 1$ and $\mathbf{a}_1^T\mathbf{a}_2 = 0$

By Pythagoras theorem, this is indeed true. So in fact the plane spanned by $\mathbf{v}_1,\mathbf{v}_2$ is the optimal plane consider all $2$-dimensional plane.

This property can be generalized to the case of $q$-dimensional hyperplane ($q \le p$). The optimal (in the orthogonal projection sense) hyperplane in this case is spanned by the first $q$ eigenvectors $\mathbf{v}_1,\cdots,\mathbf{v}_q$ of $\mathbf{\Omega}$(in the order of descending eigenvalues). If we define $\mathbf{V}_q = [\mathbf{v}_1 \ \mathbf{v}_2 \ \cdots \ \mathbf{v}_q]$ then the projection matrix onto the optimal hyperplane is $\mathbf{V}_{q} \mathbf{V}_q^T$ . (One can prove this generalized property by considering an arbitrary orthonormal basis $\mathbf{a}_1,\cdots,\mathbf{a}_q$ and the corresponding projection matrix $\mathbf{A}\mathbf{A}^T$ with $\mathbf{A}$ defined as $\mathbf{A} = [\mathbf{a}_1\ \mathbf{a}_2 \ \cdots \ \mathbf{a}_q]$, then expanding the objective function $L_{\text{plane}}$ with the condition $\mathbf{A}^T\mathbf{A} = \mathbf{I}$.)

The $\mathbf{x}_i = \mathbf{V}_q\mathbf{V}_q^T\mathbf{y}_i$ are $p$-dimensional vectors, but are completely confined to the $q$-dimensional subspace spanned by $\mathbf{v}_1,\cdots,\mathbf{v}_q$. In other words, the “effective” dimension of $\mathbf{x}$ in this case is only $q$. For this reason, in a typical Principal Component Analysis, one is not interested in the $\mathbf{x}_i$ themselves, but in the coordinates of $\mathbf{x}_i$ in the new coordinate system defined by the orthonormal basis $\mathbf{v}_1,\cdots, \mathbf{v}_q$. The coordinates of $\mathbf{x}_i$ in this system are $\mathbf{v}_1^T\mathbf{y}_i, \cdots, \mathbf{v}_q^T\mathbf{y}_i$. Therefore, further analyses after PCA often work directly with the $q$-dimensional column vector $\mathbf{z}_i$ defined as

$\mathbf{z}_i = (\mathbf{v}_1^T\mathbf{y}_i, \cdots,\mathbf{v}_q^T\mathbf{y}_i)^T$

On a side note, the $\mathbf{z}_i$ have a diagonal sample covariance matrix (i.e. the dimensions of $\mathbf{z}$ are uncorrelated). Since $\mathbf{y}$ has (sample) covariance matrix $\mathbf{\Omega}$ and $\mathbf{z}$ is a transformation of $\mathbf{y}$ with $\mathbf{z} = \mathbf{V}_q^T\mathbf{y}$, the covariance matrix of $\mathbf{z}$ is $\mathbf{V}_q^T\mathbf{\Omega}\mathbf{V}_q$, which is a diagonal matrix.

2. PCA as a solution of a ridge regression problem:

[the need for sparse PCA here]

The above formulation of PCA based on orthogonal projection onto optimal hyperplanes is actually a ridge regression formulation, and the final reformulation for SparsePCA will be very close to this.

[sparse PCA ]