What is Dimension Reduction?
More importantly why do I need it?
Before we get into the different types of dimension reduction techniques, let’s build some understanding regarding the need.
I was working on a Kaggle data set which had over 4K dimensions. Large number of dimensions adds to the complexity, the computational load in the predictive tasks and basic transformations. Visual intuition is out of the grasp at this point.
Similarly, think of datasets with variables having several factors which in classification problems are treated as separate variables. Deep learning tasks might involve an even larger set of variables such as the use case of image classification.
Of course, the most basic steps such as removing independent variables with very high correlation and the variables with near zero variance (example: a variable with a smaller spread in values and mostly concentrated at a single value), do help but only to a small extent.
What can be done? Enter Dimension Reduction Techniques
Certain Statistical techniques that allow us to transform variables to a lower dimensional space without much loss in information. We wish to find the Latent features in the data or features that provide useful information. To start with let us understand some basic techniques that help us thin the herd and then develop the intuition on more advanced techniques:
In case a variable has too many missing values making it an unlikely candidate for missing value imputation. We can drop such a variable.
Very low variance that is it has a very high concentration of observations with the same value. For example if we have a numeric variable with 99% value of 100 and remaining 1% have values in range of 110-120, it is a variable that does not help your model learn much.
If there are two variables with correlation say 0.99, we can drop one of these variables. Now, there are some techniques mentioned later on, that will take care of these type of variables during their process making this step necessary only in certain conditions for example if there is a computational restraint in some of the other techniques, then using this and following it with those techniques will help a little.
This technique might still be heavy computationally if the purpose is to solely reduce the dimensions, but it involves fitting a model such as Random forest and then evaluating which are the most important variables to the model. This process is synonymous to forward/backward feature selection in Regression.
This is a technique of grouping correlated variables to reduce the dimensional space.
If there are say two sets of variables: X1,X2,X3 are a correlated set and Y1,Y2 are the other correlated set.
Here, Y1 has no correlation with X1 that is, any variable of set A will have low correlation with variable of Set B. However, high correlation if the variables were of the same set. In such a case this problem could be reduced a set of two latent variables.
The idea behind Factor analysis is to find the representative variable from a set of similar variables.
X1 measures a Car's fuel efficiency & X2 measures the engine power. Here X1,X2 are trying to measure how well the Car runs. Y1,Y2 represent the interiors of the car, the leg space and so on. Comfort is the underlying representative variable being measured via Y1,Y2.
Principal Component Analysis:
PCA is a novel way to explain the data. Using transformation on the existing set of variables it reduces the dimensions. These transformations are carried out with the prime objective that the variation present in the data is explained and if we were to reconstruct the old variables from these new variables the error is minimum. We will explore this technique further here.
Latent Dirichlet Allocation:
LDA more popular for topic modeling, but can also be used for dimension reduction. Statistically, it assumes data points to be of two separate multivariate normal distributions with different means but the same co-variance matrix. Further, a hyper-plane that separates these two data points will be computed.
Unlike PCA, the idea here is not to minimize the reconstruction error, but to maximize the chance of separation of two classes. Hence, LDA is a supervised dimension reduction technique that finds a subspace that separates the classes as much as possible.
On a 2 D Graph, if a line is shift parallel to itself such that the entire space can be filled up, then it is a hyper-plane. A hyper-plane is a subspace whose dimension is one less than that of its ambient space. In different settings, the objects which are hyper-planes may have different properties.
Non-negative matrix factorization (NMF) is a method for discovering low dimensional
representations of non-negative data. It tries to find two non-negative matrices (W, H) whose product approximates the non-negative matrix X by matrix factorization. Generalized Low Rank Models can also be used for dimension reduction.
Let’s say we take a data point on a graph and draw a connecting edge to it’s closest points. We provide weights to these edges based on the similarity of points. This map of connected points is treated as our objective function which we are trying to minimize to a low dimensional space.
The locality-preserving character of the Laplacian eigenmap algorithm makes it relatively insensitive to outliers and noise. Laplacian Eigenmaps is thus simply trying to connect similar points and represent them on a lower dimensional space.
Multidimensional scaling (MDS) is a form of non-linear dimensionality reduction. It involves a matrix of pair-wise distances between all points, and computes a position for the point.
MDS, tries to preserve the euclidean distances while reducing the dimensions. In Isomap, we calculate the geodesic distance instead of euclidean distance in MDS. This method is computationally intensive however, at the time of release it was hailed as a major development in the field.
What is geodesic distance?
The distance between two vertices in a graph
is the number of edges in a shortest path.
If we wish to measure the distance between poles of the Earth,
Consider a series of points connected to each other,
to form the shortest path between the poles
as the geodesic distance.
Why was this used?
Geodesic distance is more effective
in capturing the structure of the manifold
as compared to euclidean distance.
Locally Linear Embedding:
A non-linear unsupervised technique that computes a new co-ordinate for each data point on a low dimensional manifold. We first look for the neighbors of a data point and then assume that if our selected point was not available, can we recover it from the identified neighbors. Thus, we have weights assigned that allow us to reconstruct our selected point. Finally, we map this point to a lower dimensional space while preserving these weights.
T-distributed stochastic neighbor embedding (t-SNE) is a neighborhood preserving embedding. While preserving the local structure of the manifold and mapping it to a low dimensional space, t-SNE also tries to preserve geometry at all scales.
To put it plainly, when we shift from a high dimensional space to a low dimensional space, it ensures points that were close still are and points far-apart remain so. It calculates the probability of the similarity of the points in high dimensional space as well as in low dimensional space, and minimizes the difference between them.
A new technique that involves manifold learning and is computationally faster than t-SNE. It tries to preserve the local and global structure and map it onto a low dimensional space using k-nearest neighbor and some concepts of topology. We will further discuss this technique here.
An autoencoder are a family of methods. It is a neural network whose primary purpose is to learn the underlying manifold or the feature space in the data set. In other words, it tries to encode (transform input data) in a hidden neural net layer and then decode it to get back as close to the input values as possible. The assumption here is that the transformations resulting in the hidden layer represent the properties of the data that are of value. We will further discuss this technique in a future post.
Consider, the above image as an example of implementing some of these techniques on the IRIS data set with the axes consisting of the components obtained from application of the respective techniques (color is with respect to the target variable). Which algorithm is right for you depends on the data set.
There are several dimension reduction techniques available however,here we have only explored some techniques that are currently prevalent in the industry or are truly novel methods.
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
- Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.