## Dimension Reduction

Before we discuss one of the most popular dimension reduction techniques (PCA), we need to know what are the reasons to use dimension reduction and the different techniques available in the data science arsenal. Please have a look at it here.

## Principal Component Analysis

### Intuition

Let’s assume that you are a coin collector and being aware of the value of rare coins, you persistently look for them. One day, you get lost in a forest during a treasure hunt. After hours of search, Voila! there it is, the treasure chest.

You notice, there are about 100,000 coins present and each of them is slightly different. How do you know which one is rare?

There can be many features to describe these coins such as the weight, color, shape, size, year, the material used and so on.
Overwhelmed by the complexity of exploring so many variables, PCA looks like a promising option.

There might be some measures that are representative of similar charecteristics, so PCA will help highlight the key characteristics of the coins with a smaller set of dimensions. That is PCA will look for a way of transforming the existing variables into dimensions that help distinguish between coins. These dimensions will be constructed such that upon recalculation of the original variables, the error is minimum.

Now let’s say as an example that the material of the coin depends on the year of mint, that is there is some correlation between the two. Such kind of relationships might exist among various features. With too many features, too much noise, we need a way to bring the number of variables down.

A new property can be constructed by a linear combination
w1 * x+ w2 * y,
where for some particular values of w1 and w2 that minimize the reconstruction error,
that is they try to explain the variance between x and y. 

That is how we are able to reduce the dimensions.

### Let’s talk some Math!

There are a few concepts that need to be expressed to truly understand the workings of PCA. Let us say we have a vector X.

$\textbf{X} = \left(\begin{array}{c} X_1\\ X_2\\ \vdots \\X_p\end{array}\right)$

With the Variance-Covariance matrix:

$\text{var}(\textbf{X}) = \Sigma = \left(\begin{array}{cccc}\sigma^2_1 & \sigma_{12} & \dots &\sigma_{1p}\\ \sigma_{21} & \sigma^2_2 & \dots &\sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \dots & \sigma^2_p\end{array}\right)$

Consider, linear transformations:

$\begin{array}{lll} Y_1 & = & e_{11}X_1 + e_{12}X_2 + \dots + e_{1p}X_p \\ Y_2 & = & e_{21}X_1 + e_{22}X_2 + \dots + e_{2p}X_p \\ & & \vdots \\ Y_p & = & e_{p1}X_1 + e_{p2}X_2 + \dots +e_{pp}X_p\end{array}$

These can be looked at like regression equations (for understanding purposes) that are being used to transform the existing variables to new dimensions and ei1, ei2, …, eip are the regression coefficients.

In other words, for the first principal component.

$\text{Max var}(Y_1) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl} = \mathbf{e}’_1\Sigma\mathbf{e}_1$

subject to:

$\mathbf{e}’_1\mathbf{e}_1 = \sum_{j=1}^{p}e^2_{1j} = 1$

### The Twist in the Story

However, to estimate the coefficients we are not using Gradient descent or method of ordinary least squares that is these are not solved like regression equations but by eigenvalues and eigenvectors of the variance-covariance matrix.

To understand how eigenvectors work consider the following:
An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it. To make it easier to understand let’s take this example where (1,2) is an eigenvector and 5 an eigenvalue:

$A v = \begin{pmatrix} 1 & 2 \\ 8 & 1 \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 2 \end{pmatrix} = 5 \begin{pmatrix} 1 \\ 2 \end{pmatrix} = \lambda v.$

Hence this is a way to transform A such that

$A v = \lambda v.$

To get back to PCA, the variance for the ith principal component is equal to the ith eigenvalue.

$\textbf{var}(Y_i) = \text{var}(e_{i1}X_1 + e_{i2}X_2 + \dots e_{ip}X_p) = \lambda_i$

where λ1 through λp denote the eigenvalues of the variance-covariance matrix and the principal components are uncorrelated with one another, that is

$\text{cov}(Y_i, Y_j) = 0$

To summarize:

$\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i’$

can be used to represent the variance-covariance matrix. (Spectral Decomposition Theorem).

Let’s take stock of what we have learnt so far by considering the following example:

The underlying idea of PCA is to reduce dimensions in such a way that we are able to explain as much variation in the data as possible. Further, we use eigenvectors to find these components which are shown in the image below.

To add more intuition, a principal component is essentially a vector of data points or co-ordinates, now take a closer look at the image.

#### The Components

If I was standing at the center of the data (point 1,3) and in order to cover the most variation I would move right (approx point 4,5). This is my first component moving in the direction of highest variation and the length of the arrow represents the amount of variation.

In order to put together my second component I follow the same protocol. The second principal component will thus be orthogonal to the first component.

Two vectors are orthogonal if their coordinates linearly cancel out.

Now, the rest is pretty simple, I want to reduce this 2D data into a single dimension. I pick the largest principal component and project all the data onto this component/line.

  Keep in mind if the nth principal component
will be orthogonal to all n-1
principal components

### To Close!

PCA is a very popular technique used these days to handle dimensionality issues and it takes care of multicollinearity as well. PCA tries to find a global structure but during the process there might be some inconsistencies in the local structure. It uses linear transformations to form components however, it is possible that a non-linear transformation is better suited for some variables.

Despite such issues it is a well known and tested algorithm.