Before we discuss this new and exciting development in dimension reduction techniques (UMAP), we need to know what are the reasons to use dimension reduction and the different techniques available in the data science arsenal. Please have a look at it here.
Uniform Manifold Approximation and Projection (UMAP)
To give you an idea of UMAP in action (MNIST dataset where we are classifying the images of digits 0-9). Note, how these clusters (for explanation purposes let’s call these concentration of points a cluster) are formed in this image. UMAP tries to preserve these local structures (within the cluster) and ensure some separation between these clusters.
Firstly, I have to mention Leland McInnes and John Healy for developing this novel manifold learning technique for dimension reduction. If you want to reference their paper you can do the same here. This technique can be used for visualization similar to t-SNE, but also for general non-linear dimension reduction. t-SNE works very well but it has limitations such as loss of large-scale information and being computationally intensive.
Let’s start with some basics and slowly develop this concept and along the way develop understanding around UMAP. We can say that there are two main steps we want to accomplish here:
- Find a way to represent the data without much loss in information regarding how the data points are located.
- Find a close enough representation in the lower dimensional space of the structure we captured in the previous step.
An n-dimensional manifold is a space that locally looks like n-dimensional Euclidean space but has a different global structure. To put it more plainly, consider a sheet of paper, if you take any point on this sheet it is easy to draw a small straight line. Now, we can take the paper and fold it to form a sphere. Now, given a spherical surface the line is curved.
Here, the straight line represents a euclidean space, and if n such lines can be drawn on this paper from n points then they will be an n-dimensional euclidean space. The sphere is the global structure which is not euclidean.
Let’s get acquainted with some ideas of Topology. Topology may sound scary but it is simply the location of a data point with regards to other points of a set. In general, a space is just a set of points and if we associate the notion of the distance between points, we call these metric spaces. Topological spaces are just generalizations of metric spaces.
Further, a cover Y of a topological space X is essentially a space that contains all points of X and some surrounding points in X’s neighborhood. If it sounds confusing consider this example, imagine you are looking from the top at a man holding an umbrella, if the man is the data point then the umbrella is the respective cover. The umbrella is covering the man and some surrounding area as well.
Simplex in topology
In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. A zero-simplex is simply a vertex, a one-simplex is an edge and so on we can form a k point simplex. Thus, combining data points to form these structures as shown below:
In algebraic topology, simplices are used as building blocks to construct an interesting class of topological spaces called simplicial complexes. These spaces are built from simplices glued together in a combinatorial fashion.
Consider that we have a data set represented in the graph below. We have placed a cover around each data point representing their neighborhood.
This data can be represented in the form of Simplicial Complexes that is by using 0-simplices and 1-simplices such as:
A neighborhood graph based approach should capture manifold structure when doing dimension reduction. The idea in UMAP is to find a topological representation of the data in a lower dimensional space. It is like an artist is trying to paint a portrait of a person. The artist wants to show a true representation of the subject’s features while converting the 3D into a 2D representation.
Assumptions and Practicality
The algorithm is founded on three assumptions about the data
- The data is uniformly distributed on Riemannian manifold;
- The Riemannian metric is locally constant (or can be approximated)
- The manifold is locally connected.
In practical application, when is data ever that clean. One question that comes to mind is that are we loosing any information by forming this representation.
Further, what should be the size of the neighborhood (cover) of a data point such that we do not loose the topological information. Another point that is evident in the image above, what if the data points are concentrated at a specific location and at some locations there are no points at all. Also, for a concentration of points, when building simplicial complexes, which data points are to be connected.
Too many questions, Let’s try to answer some of these.
In an ideal world with a uniformaly distributed data, the distance between two points would be a perfect radii to form the cover. But here we can consider varying distances based on the data point and it’s respective neighborhood.
simply put states that defining a simplicial complex is equivalent (homotopically) to a union of a cover thus the topological information is not lost.
A Riemannian metric
defined on this manifold essentially means we are trying to make sense of the angles and distances between data points. Defining a Riemannian metric is necessary as we are trying to preserve the topology over here.
To build simplicial complexes we need to consider the k nearest neighbors to a data point and these can be connected via an edge. But, what will be an appropriate k?
A small k implies attention to the fine structure of the topology while a large k implies that we want to estimate based on larger regions at the cost of the finer structure. The right k depends on the distribution of distances in the data set.
Let’s take an example:
Now, if a data point has 5 neighbors, instead of having a binary answer (Yes or No) to whether they are to be connected, we can have a fuzzy answer (a value between 0 and 1). Consider, weights for each connective edge based on the distance. The weights will help us form the simplicial complexes.
Interpret the weights as the probability of the simplex existing
The assumption of local connectivity comes into picture, that is any point on the manifold there is some sufficiently small neighborhood of the point that is connected. Thus, there should be no isolated points. Of course, if the data set has some points that are very widely spread, while connecting them our confidence in the estimations we make will be low. But it is necessary to have the assumption of local connectivity as it also help tackle the curse of dimensionality.
The distance to the first nearest neighbor can be quite large, but the distance to the tenth nearest neighbor can often be only slightly larger (in relative terms).
The local connectivity constraint ensures that we focus on the difference in distances among nearest neighbors rather than the absolute distance (which shows little differentiation among neighbors).
As these data points do not lie on a 2D graph, 2 data points can have conflicting distances to each other. That is, there can be multiple edges connecting these points. From point a’s perspective the distance from point a to point b might be 1.5, but from the perspective of point b the distance from point b to point a might only be 0.6.
To resolve such cases, we combine the weights such as: a + b – a*b
So far we have only found a way to represent the data without loosing much of the topological structure. We still need to embed this information on to a low dimensional space.
- We explicitly want the distance on the manifold to be standard euclidean distance with respect to the global coordinate system, not a varying metric. Distance to the nearest neighbor should be a property that is globally true across the manifold as we optimize toward a good low dimensional representation. In other words, look at the image of UMAP on MNIST data, the local and global structure should be preserved in the low dimensional space if it exists in the high dimensional space.
- Ultimately, while resolving conflicting weights, when we conclude if an edge will exist between two points or not, the probability is the parameter of a Bernoulli distribution. This leads us to cross-entropy which we want to minimize.
The numerator in log function here is the weight function in the high dimensional space while the denominator is the weight function in the low dimensional space. This part provides an attractive force between the points, in other words we want to preserve the structure of the data points especially if the data points are clumped together. This is because this term will be minimized when the denominator is as large as possible, which will occur when the distance between the points is as small as possible.
This is like a repulsive force, as the term will be minimized by making the weight function of the low dimensional space as small as possible. In other words, we want to ensure that the gaps or separation in the global structure right.
The right balance between these two components is what we are looking for.
To get there faster that is finding the right low dimensional topological space that is a close representation of our high dimensional data, we can use:
- RP-trees and NN-descent algorithms to find a good topological space.
- Stochastic Gradient Descent and negative sampling for the optimization process of the local layout.
In summary, I know that UMAP might look very complex but it is in fact quite an interesting technique. Further, in recent months UMAP has gained a lot of buzz and for good reason, it gives good results much faster than T-SNE.
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.