t-SNE Dimensionality Reduction Algorithm

Nazrul Miya
5 min readJun 18, 2021

In a simple sentence t-SNE (t-distributed Stochastic Neighbourhood Embedding) is a dimensionality reduction technique or algorithm which reduces the number of dimensions in a dataset, by mapping data points from high dimensional space into a location in a lower-dimensional space.
Hence t-SNE algorithm can help to solve two real problems faced by data scientist in a project,

  1. Reduce the dimension of a large dataset. Helps to improve the performance of a model in a cost-effective manner.
  2. Visualizing high-dimensional data. Heps to visualize high dimensional data in 2D or 3D space.

Let's decode what do words Neighbourhood, Embedding, Stochastic and t-Distributed means.

Neighbourhood:
The neighbourhood of a data point in an N-dimensional space is a set of other nearest data points.

Embedding:
The mapping of a point from one vector space to another vector space is called embedding.

Stochastic:
The approach to embedding data points used by the t-SNE algorithm is a probabilistic approach. The algorithm uses the probability of similarities between datapoints in the embedding approach.

t-Distributed:
It is a type of probability distribution, near similar to bell-shaped normal distribution except the tails in this distribution are heavier. This type of distribution arises when estimating the mean of a normally distributed population where the sample size is low and the standard deviation of a population is not known.

High-Level intuition

the t-SNE algorithm maps high dimensional data objects into a much lower dimensional space such that the mapping preserves both local and global structures of higher-dimensional data as much as possible in lower-dimensional space.

Distance between data points within a cluster represents a local geometric structure. Preserving this structure meaning after mapping data objects into lower-dimensional space, these distances within-cluster elements does not change.

Distance between data points from two different clusters represents global geometric structure.

In a simple way to say, the following steps are involved in this algorithm.

1. Measures similarities between data points in a high dimensional space. (Probability Measure)

2. Measures similarities of data points, mapped to lower-dimensional space. (Probability Measure)

3. Compare these similarity measurements.

If data points in higher dimensional space are correctly modelled in lower dimensional space such that local structure and global structure are preserved, then these similarity measurements will be equal.

Low-Level Intuition

Step 1:

Euclidean distance between two high-dimensional data points x_i and x_j is computed as,

The t-SNE algorithm calculates scaled squared Euclidean Distance (d_{ij}²) between two high dimensional data points as,

where sigma_i is the variance of normal distribution of data points centred at point x_i.

Step 2:

A conditional probability (p_{j|i}) is then calculated from scaled squared distances between points in high dimensional space. This measures the probability that a data point x_i will pick another data point x_j as its neighbour data point.

Where k is an input to the algorithm that measures an effective number of local neighbours.

Step 3:

Similarly in low dimensional space, a conditional probability (q_{j|i}) that a data point y_i in low dimensional space will pick another data point y_j as its neighbour, is computed.

where

In low dimensional space, a fixed variance (1/2) is used and a t-distribution of neighbour data points centred at point y_i is used.

Step 4:

conditional probabilities (p_{j|i}) and (q_{j|i}) of all data points , resembles a probability distribution respectively. the t-SNE algorithm tries to match these two probability distributions as much as possible. To compare these two probability distributions, the algorithm uses Kullback–Leibler divergence. KL Divergence D_{kl}(p_{j|i}||q_{j|i}) is a measure of how one probability distribution is different from a second one and is computed as

t-SNE algorithm tries to minimize D_{kl}(p_{j|i}||q_{j|i}) which is the cost function of t-SNE algorithm.

How to do it in python?

sklearn module sklearn.manifold.TSNE helps to achieve t-distributed stochastic neighbour embedding.

We will perform t-SNE on KDD 2009 dataset (a smaller version of it), which has 230 features, 190 numerical and 40 categorical variables. Before we execute t-SNE on 230-dimensional dataset we will execute the below steps,

1. Filter only numerical features. (Note: categorical features needs to be encoded to numerical form so that can be sent to the t-SNE algorithm, for this example we are ignoring categorical features)

2. Missing Value Filter: features with a missing value greater than 90% will be removed.

3. Missing Value Treatment: missing values in each numerical features will be filled with the mean value.

4. Standardization: features will be standardized to bring all of them in a common range.

Finally will reduce dimensionality by t-SNE.

2-D Visualisation

The t-SNE algorithm can be run multiple times for various combinations of parameters such as perplexity, n_iter, learning_rate etc.

t-SNE reports different results for each run, because the cost function is non-convex and the function is optimized through the popular approach Gradient Descent which initializes randomly on every run, hence can return different optimal KL divergence. To select the right visualisation of the high dimensional dataset, we can run t-SNE multiple times with the same K, same data, and other parameters. Select the visualisation for which t-SNE reports the lowest KL divergence value.

Check FAQs in https://lvdmaaten.github.io/tsne/ for useful information on t-SNE.

t-SNE can be computationaly expensive for large dataset. Also finding right combination of hyperparameters is really tricky.

t-SNE can report misleading results. Check out this great article https://distill.pub/2016/misread-tsne/ for more clarity

--

--