t-SNE from Scratch (ft. NumPy)

Machine Learning • Manifold Learning

t-SNE from Scratch (ft. NumPy)

Acquire a deep understanding of the inner workings of t-SNE via implementation from scratch in python.

KL DivergencePerplexityNeighbor Embedding

Key Takeaways

t-SNE preserves local neighborhood structure by matching pairwise affinities.

Perplexity controls effective neighborhood size via σ selection.

Optimization minimizes KL(P∥Q) using a Student-t kernel in low dimensions.

\[\; C=\sum_i\sum_j p_{ij}\log\frac{p_{ij}}{q_{ij}} \;\]

Imports and Settings

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import animation
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA

np.seterr(divide="ignore", invalid="ignore")
warnings.simplefilter("ignore", RuntimeWarning)
warnings.filterwarnings("ignore")

Introduction

I have found that one of the best ways to truly understanding any statistical algorithm or methodology is to manually implement it yourself. On the flip side, coding these algorithms can sometimes be time consuming and a real pain, and when somebody else has already done it, why would I want to spend my time doing it — seems inefficient, no? Both are fair points, and I am not here to make an argument for one over the other.

This article is designed for readers who are interested in understanding t-SNE via translation of the mathematics in the original paper — by Laurens van der Maaten & Geoffrey Hinton — into python code implementation.¹ I find these sort of exercises to be quite illuminating into the inner workings of statistical algorithms/models and really test your underlying understanding and assumptions regarding these algorithms/models. You are almost guaranteed to walk away with a better understanding then you had before. At the very minimum, successful implementation is always very satisfying!

This article will be accessible to individuals with any level of exposure of t-SNE. However, note a few things this post definitely is not:

A strictly conceptual introduction and exploration of t-SNE, as there are plenty of other great resources out there that do this; nevertheless, I will be doing my best to connect the mathematical equations to their intuitive/conceptual counterparts at each stage of implementation.
A comprehensive discussion into the applications & pros/cons of t-SNE, as well as direct comparisons of t-SNE to other dimensionality reduction techniques. I will, however, briefly touch on these topics throughout this article, but will by no means cover this in-depth.

Without further ado, let’s start with a brief introduction into t-SNE.

A Brief Introduction to t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction tool that is primarily used in datasets with a large dimensional feature space and enables one to visualize the data down, or project it, into a lower dimensional space (usually 2-D). It is especially useful for visualizing non-linearly separable data wherein linear methods such as Principal Component Analysis (PCA) would fail. Generalizing linear frameworks of dimensionality reduction (such as PCA) into non-linear approaches (such as t-SNE) is also known as Manifold Learning. These methods can be extremely useful for visualizing and understanding the underlying structure of a high dimensional, non-linear data set, and can be great for disentangling and grouping together observations that are similar in the high-dimensional space. For more information on t-SNE and other manifold learning techniques, the scikit-learn documentation is a great resource. Additionally, to read about some cool areas t-SNE has seen applications, the Wikipedia page highlights some of these areas with references to the work.

Let’s start with breaking down the name t-distributed stochastic neighbor embedding into its components. t-SNE is an extension on stochastic neighbor embedding (SNE) presented 6 years earlier in this paper by Geoffrey Hinton & Sam Roweis. So let’s start there. The stochastic part of the name comes from the fact that the objective function is not convex and thus different results can arise from different initializations. The neighbor embedding highlights the nature of the algorithm — optimally mapping the points in the original high-dimensional space into the corresponding low-dimensional space while best preserving the “neighborhood” structure of the points. SNE is comprised of the following (simplified) steps:

Obtain the Similarity Matrix between Points in the Original Space: Compute the conditional probabilities for each datapoint \(j\) relative to each datapoint \(i\). These conditional probabilities are calculated in the original high-dimensional space using a Gaussian centered at \(i\) and take on the following interpretation: the probability that i would pick \(j\) as its neighbor in the original space. This creates a matrix that represents similarities between the points.
Initialization: Choose random starting points in the lower-dimensional space (say, 2-D) for each datapoint in the original space and compute new conditional probabilities similarly as above in this new space.
Mapping: Iteratively improve upon the points in the lower-dimensional space until the Kullback-Leibler divergences between all the conditional probabilities is minimized. Essentially we are minimizing the differences in the probabilities between the similarity matrices of the two spaces so as to ensure the similarities are best preserved in the mapping of the original high-dimensional dataset to the low-dimensional dataset.

t-SNE improves upon SNE in two primary ways:

It minimizes the Kullback-Leibler divergences between the joint probabilities rather than the conditional probabilities. The authors refer to this as “symmetric SNE” b/c their approach ensures that the joint probabilities \(p_ij\) = \(p_ji\). This results in a much better behaved cost function that is easier to optimize.
It computes the similarities between points using a student’s t-distribution w/ one degree of freedom (also a Cauchy Distribution) rather than a Gaussian in the low-dimensional space (step 2 above). Here we can see where the “t” in t-SNE is coming from. This improvement helps to alleviate the “crowding problem” highlighted by the authors and to further improve the optimization problem. This “crowding problem” can be envisioned as such: Imagine we have a 10-D space, the amount of space available in 2-D will not be sufficient to accurately capture those moderately dissimilar points compared to the amount of space for nearby points relative to the amount of space available in a 10-D space. More simply, just envision taking a 3-D space and projecting it down to 2-D, the 3-D space will have much more overall space to model the similarities relative to the projection down into 2-D. The Student-t distribution helps alleviate this problem by having heavier tails than the normal distribution. See the original paper for a much more in-depth treatment of this problem.

If this did not all track immediately, that is okay! I am hoping when we implement this in code, the pieces will all fall in to place. The main takeaway is this: t-SNE models similarities between datapoints in the high-dimensional space using joint probabilities of “datapoints choosing others as its neighbor”, and then tries to find the best mapping of these points down into the low-dimensional space, while best preserving the original high-dimensional similarities.

Implementation from Scratch

Let’s now move on to understanding t-SNE via implementing the original version of the algorithm as presented in the paper by Laurens van der Maaten & Geoffrey Hinton. We will first start with implementing algorithm 1 below step-by-step, which will cover 95% of the main algorithm. There are two additional enhancements the authors note: 1) Early Exaggeration & 2) Adaptive Learning Rates. We will only discuss adding in the early exaggeration as that is most conducive in aiding the interpretation of the actual algorithms inner workings, as the adaptive learning rate is focused on improving the rates of convergence.

1. Inputs

Following the original paper, we will be using the publicly available MNIST dataset from OpenML with images of handwritten digits from 0–9.² We will also randomly sample 1000 images from the dataset & reduce the dimensionality of the dataset using Principal Component Analysis (PCA) and keep 30 components. These are both to improve computational time of the algorithm, as the code here is not optimized for speed, but rather for interpretability & learning.

Code

# Fetch MNIST data
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

X_total = pd.DataFrame(mnist["data"])
y_total = pd.DataFrame(mnist["target"])

X_reduced = X_total.sample(n=1000)
y_reduced = y_total.loc[X_reduced.index]

# PCA to keep 30 components
X = PCA(n_components=30).fit_transform(X_reduced)

This will be our X dataset with each row being an image and each column being a feature, or principal component in this case (i.e. linear combinations of the original pixels):

Code

pd.DataFrame(X)

	0	1	2	3	4	5	6	7	8	9	...	20	21	22	23	24	25	26	27	28	29
0	-931.915201	411.259681	-66.223330	-22.101824	-522.071794	-204.238021	79.810311	-151.501689	-2.645304	-71.409799	...	48.977339	64.932067	82.663785	81.841769	113.573963	-289.648157	-37.536887	44.385295	-33.734615	116.867908
1	-498.965271	257.802505	-448.205554	-385.294813	349.350506	-672.517543	-635.617347	127.445850	91.703337	78.129701	...	14.600850	104.047832	-61.352867	79.213565	40.009029	-305.744439	-452.102839	-0.012526	129.821085	-51.821047
2	-653.828524	180.805708	-37.946817	-178.391590	-753.344540	87.362490	-273.592196	141.444895	-237.866486	131.811097	...	-48.118871	-13.359654	198.588208	-37.209171	9.447840	71.514532	-25.251364	14.821461	-139.170878	-80.313444
3	27.990477	392.334507	477.493165	40.194836	505.795588	136.666229	58.409040	-101.268468	-179.489579	-194.865054	...	-243.163467	-175.022793	259.175164	471.665036	173.720381	-21.926315	40.559378	121.260797	-103.262280	-174.272627
4	108.686834	-443.761169	1047.275507	-269.081714	698.282492	268.993661	2.649905	227.015596	590.552236	64.509176	...	34.180812	-14.272677	-4.267202	-380.389612	233.763787	-196.247229	-21.257888	-144.816963	-130.726495	105.042093
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	704.001866	416.201728	728.202071	545.738046	-13.182785	168.897764	-252.975670	-3.950580	-212.479052	-567.985417	...	327.495047	44.757261	11.850213	-324.565843	-103.983645	-79.764025	486.828049	-174.229356	-399.590480	-207.753263
996	-435.583186	-518.925730	523.326068	273.215373	-281.536846	368.020921	-30.906676	-86.819580	130.227015	227.218531	...	4.785787	-84.410557	-304.009573	13.309568	-254.705140	-115.894610	96.160201	97.570096	-40.360102	196.480744
997	290.048713	-123.651858	38.271626	-788.843811	-219.241666	547.644998	-41.129062	-400.736268	-440.175055	-339.627691	...	70.933713	415.261998	-277.123724	68.509559	79.043054	205.682516	191.826804	209.089092	-292.390174	-24.164274
998	-81.494684	-893.865631	284.655251	-193.626306	208.005578	108.519770	-138.118422	-44.223366	-223.193259	345.672341	...	297.352410	89.948871	-19.759266	-122.929192	-1.523811	103.019254	-240.920921	117.842351	-171.811250	-268.943841
999	-237.792154	-449.364853	993.977356	-20.240607	-83.963739	197.301905	198.658445	404.901898	526.368305	368.054294	...	-76.226814	-66.050405	-62.809371	43.364113	-449.005920	-266.624174	211.255202	74.683071	147.336721	343.400059

1000 rows × 30 columns

We also will need to specify the cost function parameters — perplexity — and the optimization parameters — iterations, learning rate, & momentum. We will hold off on these for now and address them as they appear at each stage.

In terms of output, recall that we seek a the low-dimensional mapping of the original dataset X. We will be mapping the original space into a 2 dimensional space throughout this example. Thus, our new output will be the 1000 images now represented in a 2 dimensional space rather than the original 30 dimensional space, with a shape of [1000, 2].

2. Compute Affinities/Similarities of X in the Original Space

Now that we have our inputs, the first step is to compute the pairwise similarities in the original high dimensional space. That is, for each image \(i\) we compute the probability that \(i\) would pick image \(j\) as its neighbor in the original space for each \(j\). These probabilities are calculated via a normal distribution centered around each point and then are normalized to sum up to 1. Mathematically, we have:

\[ \begin{equation} p_{j|i}=\frac{\exp{(-\|x_i-x_j\|^2/2\sigma_i^2)}}{\sum_{k\ne i}\exp{(-\|x_i-x_j\|^2/2\sigma_i^2)}} \tag{1} \end{equation} \]

Note that, in our case with n = 1000, these computations will result in a 1000 x 1000 matrix of similarity scores. Note we set \(p\) = 0 whenever \(i\) = \(j\) b/c we are modeling pairwise similarities. However, you may notice that we have not mentioned how \(\sigma\) is determined. This value is determined for each observation \(i\) via a grid search based on the user-specified desired perplexity of the distributions. We will talk about this immediately below, but let’s first look at how we would code eq. (1) above:

Code

def get_original_pairwise_affinities(
    X: np.ndarray, perplexity: int = 10
) -> np.ndarray:
    """
    Function to obtain affinities matrix.

    Parameters
    ----------
    X
        The input data array.
    perplexity
        The perplexity value for the grid search.

    Returns
    ------
    np.ndarray
        The pairwise affinities matrix.
    """

    n = len(X)

    print("Computing Pairwise Affinities....")

    p_ij = np.zeros(shape=(n, n))
    for i in range(0, n):
        # Equation 1 numerator
        diff = X[i] - X
        σ_i = grid_search(diff, i, perplexity)  # Grid Search for σ_i
        norm = np.linalg.norm(diff, axis=1)
        p_ij[i, :] = np.exp(-(norm**2) / (2 * σ_i**2))

        # Set p = 0 when j = i
        np.fill_diagonal(p_ij, 0)

        # Equation 1
        p_ij[i, :] = p_ij[i, :] / np.sum(p_ij[i, :])

    # Set 0 values to minimum numpy value (ε approx. = 0)
    ε = np.nextafter(0, 1)
    p_ij = np.maximum(p_ij, ε)

    print("Completed Pairwise Affinities Matrix. \n")

    return p_ij

Now before we look at the results of this code, let’s discuss how we determine the values of \(\sigma\) via the grid_search() function. Given a specified value of perplexity (which in this context can be loosely thought about as the number of nearest neighbors for each point), we do a grid search over a range of values of \(\sigma\) such that the following equation is as close to equality as possible for each \(i\):

\[ \begin{equation} Perp(P_i)=2^{H(P_i)} \tag{2} \end{equation} \]

where \(H(P_i)\) is the Shannon entropy of \(P\):

\[ \begin{equation} H(P_i) = - \sum_jp_{j|i}\log_2p_{j|i} \tag{2} \end{equation} \]

In our case, we will set perplexity = 10 and set the search space to be defined by [0.01 * standard deviation of the norms for the difference between images \(i\) and \(j\), 5 * standard deviation of the norms for the difference between images \(i\) and \(j\)] divided into 200 equal steps. Knowing this, we can define our grid_search() function as follows:

Code

def grid_search(diff_i: np.ndarray, i: int, perplexity: int) -> float:
    """
    Helper function to obtain σ's based on user-specified perplexity.

    Parameters
    -----------
    diff_i
        Array containing the pairwise differences between data points.
    i
        Index of the current data point.
    perplexity
        User-specified perplexity value.

    Returns
    -------
    float
        The value of σ that satisfies the perplexity condition.
    """

    result = np.inf  # Set first result to be infinity

    norm = np.linalg.norm(diff_i, axis=1)
    std_norm = np.std(
        norm
    )  # Use standard deviation of norms to define search space

    for σ_search in np.linspace(0.01 * std_norm, 5 * std_norm, 200):
        # Equation 1 Numerator
        p = np.exp(-(norm**2) / (2 * σ_search**2))

        # Set p = 0 when i = j
        p[i] = 0

        # Equation 1 (ε -> 0)
        ε = np.nextafter(0, 1)
        p_new = np.maximum(p / np.sum(p), ε)

        # Shannon Entropy
        H = -np.sum(p_new * np.log2(p_new))

        # Get log(perplexity equation) as close to equality
        if np.abs(np.log(perplexity) - H * np.log(2)) < np.abs(result):
            result = np.log(perplexity) - H * np.log(2)
            σ = σ_search

    return σ

Given these functions, we can compute the affinity matrix via:

Code

p_ij = get_original_pairwise_affinities(X)

Computing Pairwise Affinities....
Completed Pairwise Affinities Matrix.

Code

pd.DataFrame(p_ij)

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	4.940656e-324	3.553563e-25	4.537423e-09	6.174442e-25	4.241161e-39	1.868486e-13	5.931357e-12	1.012538e-48	2.593082e-20	3.568851e-31	...	2.456403e-30	6.269366e-36	4.431148e-35	2.316721e-28	2.961418e-22	1.280241e-41	5.192504e-23	3.703750e-31	6.202265e-31	3.336875e-33
1	3.885230e-05	4.940656e-324	1.142135e-04	6.781833e-06	1.321032e-08	6.115228e-04	8.990179e-07	1.069983e-10	1.537003e-06	9.266514e-07	...	2.454683e-06	9.609287e-08	9.919418e-09	4.872401e-09	3.240739e-05	1.456905e-09	8.230670e-07	2.699673e-07	4.660745e-06	2.299918e-08
2	2.184044e-04	3.651837e-13	4.940656e-324	1.762836e-14	9.519416e-25	6.545152e-05	2.958753e-11	2.839515e-27	4.390662e-11	6.153744e-15	...	3.428155e-17	2.040847e-18	5.982949e-18	2.829969e-18	3.519467e-09	6.995346e-23	2.919784e-10	2.290682e-13	1.080870e-13	5.873674e-17
3	1.565194e-09	9.061402e-12	2.659864e-09	4.940656e-324	1.487900e-10	2.419839e-10	3.162685e-09	5.968172e-19	9.871110e-08	1.591932e-12	...	1.093071e-11	2.111759e-09	6.608259e-19	1.369975e-04	1.064901e-08	1.373909e-10	6.861039e-11	1.425653e-10	2.098048e-10	2.096172e-12
4	2.795045e-09	5.231274e-11	1.470112e-09	5.848820e-06	4.940656e-324	1.696103e-10	4.795533e-07	1.559432e-13	1.606426e-08	2.920207e-11	...	2.483382e-08	1.437875e-10	4.482341e-13	1.131022e-04	6.983303e-03	8.733646e-10	9.160671e-06	2.484039e-09	3.769483e-05	1.087933e-04
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	4.304576e-10	7.012836e-13	8.647610e-09	9.318705e-06	6.919233e-10	4.351412e-10	1.073513e-10	6.971529e-13	1.656619e-10	5.879043e-09	...	4.372448e-07	3.663558e-08	2.204474e-09	1.585569e-05	5.331134e-09	4.940656e-324	6.300859e-08	4.980083e-07	6.978539e-09	5.869801e-10
996	3.002293e-09	1.225874e-14	5.422104e-07	2.129593e-11	9.211913e-11	1.823652e-10	8.056597e-08	8.890065e-25	8.447319e-09	6.294541e-14	...	8.962824e-15	1.090101e-13	6.806936e-14	5.332980e-13	1.067914e-04	2.140688e-14	4.940656e-324	4.340638e-11	1.281880e-06	2.281904e-04
997	2.050591e-07	5.158642e-09	2.258650e-05	1.988845e-06	2.167747e-09	9.226990e-08	7.558719e-09	1.129690e-11	3.781045e-06	2.009996e-07	...	3.680358e-09	2.579330e-08	1.330346e-09	5.860590e-07	1.940207e-06	1.807296e-07	2.014530e-06	4.940656e-324	2.774594e-05	2.205589e-09
998	2.798479e-10	3.928866e-10	1.943395e-07	9.892538e-09	9.252175e-08	2.825299e-10	4.490271e-09	1.254390e-17	5.414178e-09	1.456098e-08	...	7.369063e-12	1.111492e-13	3.058384e-13	1.796597e-10	2.207307e-04	9.734174e-13	3.392389e-05	4.061103e-07	4.940656e-324	1.326595e-10
999	8.373567e-07	7.092628e-09	7.586651e-06	2.315556e-06	9.827672e-05	2.095126e-07	1.999695e-06	1.645698e-11	5.052185e-07	9.263120e-09	...	1.687096e-07	1.664473e-08	2.254217e-09	1.751725e-06	3.150680e-04	2.446679e-08	9.187911e-03	5.721215e-08	1.553391e-06	4.940656e-324

1000 rows × 1000 columns

Note, the diagonal elements are set to \(\epsilon \approx 0\) by construction (whenever \(i\) = \(j\)). Recall that a key extension of the t-SNE algorithm is to compute the joint probabilities rather than the conditional probabilities. This is computed simply as follow:

\[ \begin{equation} p_{ij}=\frac{p_{j|i}+p_{i|j}}{2n} \tag{3} \end{equation} \]

Thus, we can define a new function

Code

def get_symmetric_p_ij(p_ij: np.ndarray) -> np.ndarray:
    """
    Function to obtain symmetric affinities matrix utilized in t-SNE.

    Parameters
    ----------
    p_ij
        The input affinity matrix.

    Returns
    np.ndarray
        The symmetric affinities matrix.
    """
    print("Computing Symmetric p_ij matrix....")

    n = len(p_ij)
    p_ij_symmetric = np.zeros(shape=(n, n))
    for i in range(0, n):
        for j in range(0, n):
            p_ij_symmetric[i, j] = (p_ij[i, j] + p_ij[j, i]) / (2 * n)

    # Set 0 values to minimum numpy value (ε approx. = 0)
    ε = np.nextafter(0, 1)
    p_ij_symmetric = np.maximum(p_ij_symmetric, ε)

    print("Completed Symmetric p_ij Matrix. \n")

    return p_ij_symmetric

Feeding in p_ij from above, we obtain the following symmetric affinities matrix:

Code

p_ij_symmetric = get_symmetric_p_ij(p_ij)

Computing Symmetric p_ij matrix....
Completed Symmetric p_ij Matrix.

Code

pd.DataFrame(p_ij_symmetric)

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	4.940656e-324	1.942615e-08	1.092044e-07	7.825968e-13	1.397522e-12	3.202544e-07	3.319254e-09	7.232154e-10	5.912796e-10	1.923277e-12	...	6.540083e-09	6.621684e-10	5.148447e-10	1.438509e-09	7.669477e-13	2.152288e-13	1.501146e-12	1.025295e-10	1.399239e-13	4.186783e-10
1	1.942615e-08	4.940656e-324	5.710676e-08	3.390921e-09	6.631316e-12	3.220521e-07	4.495090e-10	5.534861e-11	7.685366e-10	4.634043e-10	...	4.008354e-09	1.425444e-10	1.042633e-11	2.446104e-12	1.620372e-08	7.288033e-13	4.115335e-10	1.375630e-10	2.330569e-09	1.504590e-11
2	1.092044e-07	5.710676e-08	4.940656e-324	1.329941e-12	7.350562e-13	3.063667e-06	1.259928e-12	2.899428e-09	1.280507e-09	5.141157e-10	...	1.485147e-08	7.380561e-09	4.658381e-09	4.578475e-10	3.279662e-10	4.323805e-12	2.712512e-10	1.129325e-08	9.716982e-11	3.793326e-09
3	7.825968e-13	3.390921e-09	1.329941e-12	4.940656e-324	2.924484e-09	8.389021e-11	1.590171e-12	2.463520e-09	3.553312e-10	1.952497e-12	...	1.006520e-08	9.480590e-08	1.084791e-11	6.106704e-06	5.613025e-12	4.659421e-09	4.495316e-14	9.944938e-10	5.051172e-12	1.157779e-09
4	1.397522e-12	6.631316e-12	7.350562e-13	2.924484e-09	4.940656e-324	8.693272e-14	2.397767e-10	1.333054e-10	8.035810e-12	1.460260e-14	...	5.487371e-10	2.097159e-11	3.341657e-13	1.108796e-07	3.492181e-06	7.826440e-13	4.580382e-09	2.325893e-12	1.889367e-08	1.035350e-07
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	2.152288e-13	7.288033e-13	4.323805e-12	4.659421e-09	7.826440e-13	2.245980e-13	5.367567e-14	3.781865e-10	8.285628e-14	2.943525e-12	...	3.877889e-09	9.434798e-10	9.315206e-11	1.278966e-08	2.665567e-12	4.940656e-324	3.150431e-11	3.393689e-10	3.489756e-12	1.252688e-11
996	1.501146e-12	4.115335e-10	2.712512e-10	4.495316e-14	4.580382e-09	1.494069e-10	4.118153e-11	5.884031e-11	1.693851e-10	2.555149e-13	...	6.597732e-10	2.604844e-09	1.085590e-09	2.406863e-10	6.444750e-08	3.150431e-11	4.940656e-324	1.007287e-09	1.760288e-08	4.708051e-06
997	1.025295e-10	1.375630e-10	1.129325e-08	9.944938e-10	2.325893e-12	4.813365e-11	3.779360e-12	7.329960e-10	1.899246e-09	1.023884e-10	...	1.541718e-10	1.073140e-09	6.393322e-11	1.114383e-09	9.701086e-10	3.393689e-10	1.007287e-09	4.940656e-324	1.407602e-08	2.970887e-11
998	1.399239e-13	2.330569e-09	9.716982e-11	5.051172e-12	1.889367e-08	5.801515e-12	2.245399e-12	3.957500e-10	5.536396e-12	1.852097e-10	...	6.652160e-10	1.043128e-10	1.047833e-10	2.189754e-10	1.117092e-07	3.489756e-12	1.760288e-08	1.407602e-08	4.940656e-324	7.767619e-10
999	4.186783e-10	1.504590e-11	3.793326e-09	1.157779e-09	1.035350e-07	1.049479e-10	9.998477e-10	5.366077e-11	2.526257e-10	4.631583e-12	...	3.960052e-10	5.317912e-11	5.649993e-12	1.083718e-09	1.575433e-07	1.252688e-11	4.708051e-06	2.970887e-11	7.767619e-10	4.940656e-324

1000 rows × 1000 columns

Now we have completed the first main step in t-SNE! We computed the symmetric affinity matrix in the original high-dimensional space. Before we dive right into the optimization stage, we will discuss the main components of the optimization problem in the next two steps and then combine them into our for loop.

3. Sample Initial Solution & Compute Low Dimensional Affinity Matrix

Now we want to sample a random initial solution in the lower dimensional space as follows:

Code

def initialization(
    X: np.ndarray, n_dimensions: int = 2, initialization: str = "random"
) -> np.ndarray:
    """
    Obtain initial solution for t-SNE either randomly or using PCA.

    Parameters
    ----------
    X
        The input data array.
    n_dimensions
        The number of dimensions for the output solution. Default is 2.
    initialization
        The initialization method. Can be 'random' or 'PCA'. Default is 'random'.

    Returns
    -------
    np.ndarray
        The initial solution for t-SNE.

    Raises
    -------
    ValueError
        If the initialization method is neither 'random' nor 'PCA'.
    """

    # Sample Initial Solution
    if initialization == "random" or initialization != "PCA":
        y0 = np.random.normal(loc=0, scale=1e-4, size=(len(X), n_dimensions))
    elif initialization == "PCA":
        X_centered = X - X.mean(axis=0)
        _, _, Vt = np.linalg.svd(X_centered)
        y0 = X_centered @ Vt.T[:, :n_dimensions]
    else:
        raise ValueError("Initialization must be 'random' or 'PCA'")

    return y0

Code

y0 = initialization(X)

Code

pd.DataFrame(y0)

	0	1
0	-0.000006	-0.000083
1	-0.000061	-0.000056
2	0.000125	-0.000092
3	-0.000007	0.000005
4	0.000079	-0.000019
...	...	...
995	0.000128	0.000066
996	0.000113	-0.000063
997	-0.000060	0.000040
998	0.000073	-0.000030
999	0.000195	-0.000062

1000 rows × 2 columns

Now, we want to compute the affinity matrix in this lower dimensional space. However, recall that we do this utilizing a student’s t-distribution w/ 1 degree of freedom:

\[ \begin{equation} q_{ij}=\frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_{k \ne l} (1+\|y_i-y_j\|^2)^{-1}} \tag{4} \end{equation} \]

Again, we set \(q=0\) whenever \(i = j\). Note this equation differs from eq. (1) in that the denominator is over all \(i\) and thus symmetric by construction. Putting this into code, we obtain:

Code

def get_low_dimensional_affinities(Y: np.ndarray) -> np.ndarray:
    """
    Obtain low-dimensional affinities.

    Parameters
    -----------
    Y
        The low-dimensional representation of the data points.

    Returns
    -------
    np.ndarray
        The low-dimensional affinities matrix.
    """

    n = len(Y)
    q_ij = np.zeros(shape=(n, n))

    for i in range(0, n):
        # Equation 4 Numerator
        diff = Y[i] - Y
        norm = np.linalg.norm(diff, axis=1)
        q_ij[i, :] = (1 + norm**2) ** (-1)

    # Set p = 0 when j = i
    np.fill_diagonal(q_ij, 0)

    # Equation 4
    q_ij = q_ij / q_ij.sum()

    # Set 0 values to minimum numpy value (ε approx. = 0)
    ε = np.nextafter(0, 1)
    q_ij = np.maximum(q_ij, ε)

    return q_ij

Here we are seeking a 1000 x 1000 affinity matrix but now in the lower dimensional space:

Code

q_ij = get_low_dimensional_affinities(y0)

Code

pd.DataFrame(q_ij)

	0	1	2	3	4	5	6	7	8	9	...	990	991	992	993	994	995	996	997	998	999
0	4.940656e-324	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
1	1.001001e-06	4.940656e-324	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
2	1.001001e-06	1.001001e-06	4.940656e-324	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
3	1.001001e-06	1.001001e-06	1.001001e-06	4.940656e-324	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
4	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	4.940656e-324	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	4.940656e-324	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06
996	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	4.940656e-324	1.001001e-06	1.001001e-06	1.001001e-06
997	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	4.940656e-324	1.001001e-06	1.001001e-06
998	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	4.940656e-324	1.001001e-06
999	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	0.000001	0.000001	0.000001	0.000001	0.000001	...	0.000001	0.000001	0.000001	0.000001	0.000001	1.001001e-06	1.001001e-06	1.001001e-06	1.001001e-06	4.940656e-324

1000 rows × 1000 columns

4. Compute Gradient of the Cost Function

Recall, our cost function is the Kullback-Leibler divergence between joint probability distributions in the high dimensional space and low dimensional space:

\[ \begin{equation} C=\text{KL}(P\|Q)=\sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}} \tag{5} \end{equation} \]

Intuitively, we want to minimize the difference in the affinity matrices \(p_{ij}\) and \(q_{ij}\) thereby best preserving the “neighborhood” structure of the original space. The optimization problem is solved using gradient descent, but first let’s look at computing the gradient for the cost function above. The authors derive (see appendix A of the paper) the gradient of the cost function as follows:

\[ \begin{equation} \frac{\partial C}{\partial y_i} = 4 \sum_j(p_{ij}-q_{ij})(1+\|y_i-y_j\|^2)^{-1}(y_i-y_j) \tag{6} \end{equation} \]

In python, we have:

Code

def get_gradient(p_ij: np.ndarray, q_ij: np.ndarray, Y: np.ndarray) -> np.ndarray:
    """
    Obtain gradient of cost function at current point Y.

    Parameters
    ----------
    p_ij
        The joint probability distribution matrix.
    q_ij
        The Student's t-distribution matrix.
    Y
        The current point in the low-dimensional space.

    Returns
    ------
    np.ndarray
        The gradient of the cost function at the current point Y.
    """

    n = len(p_ij)

    # Compute gradient
    gradient = np.zeros(shape=(n, Y.shape[1]))
    for i in range(0, n):
        # Equation 5
        diff = Y[i] - Y
        A = np.array([(p_ij[i, :] - q_ij[i, :])])
        B = np.array([(1 + np.linalg.norm(diff, axis=1) ** 2) ** (-1)])
        C = diff
        gradient[i] = 4 * np.sum((A * B).T * C, axis=0)

    return gradient

Feeding in the relevant arguments, we obtain the gradient as follows:

Code

gradient = get_gradient(p_ij_symmetric, q_ij, y0)

Code

pd.DataFrame(gradient)

	0	1
0	-1.286946e-07	-9.462969e-09
1	1.707158e-09	-1.425793e-09
2	2.251103e-07	4.674707e-08
3	9.274299e-08	6.214602e-08
4	-2.121515e-07	1.341276e-07
...	...	...
995	-2.399576e-07	-2.276706e-07
996	-1.773476e-07	1.952968e-07
997	-2.348704e-07	-2.571400e-08
998	9.532368e-08	1.930476e-07
999	-2.434140e-07	1.972993e-08

1000 rows × 2 columns

Now, we have all the pieces in order to solve the optimization problem!

5. Iterate & Optimize the Low-Dimensional Mapping

In order to update our low-dimensional mapping, we use gradient descent with momentum as outlined by the authors:

\[ \begin{equation} \mathcal{Y}^{(t)}=\mathcal{Y}^{(t-1)}+\eta\frac{\partial C}{\partial \mathcal{Y}} + \alpha(t)(\mathcal{Y}^{(t-1)}-\mathcal{Y}^{(t-2)}) \tag{7} \end{equation} \]

where \(\eta\) is our learning rate and \(\alpha(t)\) is our momentum term as a function of time. The learning rate controls the step size at each iteration and the momentum term allows the optimization algorithm to gain inertia in the smooth direction of the search space, while not being bogged down by the noisy parts of the gradient. We will set \(\eta=200\) for our example and will fix \(\alpha(t)=0.5\) if \(t < 250\) and \(\alpha(t)=0.8\) otherwise. We have all the components necessary above to compute to the update rule, thus we can now run our optimization over a set number of iterations \(T\) (we will set \(T=1000\)).

Before we set up for iteration scheme, let’s first introduce the enhancement the authors refer to as “early exaggeration”. This term is a constant that scales the original matrix of affinities \(p_{ij}\). What this does is it places more emphasis on modeling the very similar points (high values in \(p_{ij}\) from the original space) in the new space early on and thus forming “clusters” of highly similar points. The early exaggeration is placed on at the beginning of the iteration scheme (\(T<250\)) and then turned off otherwise. Early exaggeration will be set to 4 in our case. We will see this in action in our visual below following implementation.

Now, putting all of the pieces together for the algorithm, we have the following:

Code

def tsne(
    X: np.ndarray,
    perplexity: int = 10,
    T: int = 1000,
    η: int = 200,
    early_exaggeration: int = 4,
    n_dimensions: int = 2,
) -> list[np.ndarray, np.ndarray]:
    """
    t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithm implementation.

    Parameters
    ----------
    X
        The input data matrix of shape (n_samples, n_features).
    perplexity
        The perplexity parameter. Default is 10.
    T
        The number of iterations for optimization. Default is 1000.
    η
        The learning rate for updating the low-dimensional embeddings. Default is 200.
    early_exaggeration
        The factor by which the pairwise affinities are exaggerated during the early iterations of optimization.
        Default is 4.
    n_dimensions
        The number of dimensions of the low-dimensional embeddings.
        Default is 2.

    Returns
    -------
    list[np.ndarray, np.ndarray]
        A list containing the final low-dimensional embeddings and the history of embeddings at each iteration.
    """
    n = len(X)

    # Get original affinities matrix
    p_ij = get_original_pairwise_affinities(X, perplexity)
    p_ij_symmetric = get_symmetric_p_ij(p_ij)

    # Initialization
    Y = np.zeros(shape=(T, n, n_dimensions))
    Y_minus1 = np.zeros(shape=(n, n_dimensions))
    Y[0] = Y_minus1
    Y1 = initialization(X, n_dimensions)
    Y[1] = np.array(Y1)

    print("Optimizing Low Dimensional Embedding....")
    # Optimization
    for t in range(1, T - 1):
        # Momentum & Early Exaggeration
        if t < 250:
            α = 0.5
            early_exaggeration = early_exaggeration
        else:
            α = 0.8
            early_exaggeration = 1

        # Get Low Dimensional Affinities
        q_ij = get_low_dimensional_affinities(Y[t])

        # Get Gradient of Cost Function
        gradient = get_gradient(early_exaggeration * p_ij_symmetric, q_ij, Y[t])

        # Update Rule
        Y[t + 1] = (
            Y[t] - η * gradient + α * (Y[t] - Y[t - 1])
        )  # Use negative gradient

        # Compute current value of cost function
        if t % 50 == 0 or t == 1:
            cost = np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))
            print(f"Iteration {t}: Value of Cost Function is {cost}")

    print(
        f"Completed Low Dimensional Embedding: Final Value of Cost Function is {np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))}"
    )
    solution = Y[-1]

    return solution, Y

Now calling the code:

Code

solution, Y = tsne(X)

Computing Pairwise Affinities....
Completed Pairwise Affinities Matrix. 

Computing Symmetric p_ij matrix....
Completed Symmetric p_ij Matrix. 

Optimizing Low Dimensional Embedding....
Iteration 1: Value of Cost Function is 4.4406978130558805
Iteration 50: Value of Cost Function is 2.962401546326715
Iteration 100: Value of Cost Function is 2.700973774872353
Iteration 150: Value of Cost Function is 2.626831291889962
Iteration 200: Value of Cost Function is 2.5857498685122136
Iteration 250: Value of Cost Function is 2.564941209380378
Iteration 300: Value of Cost Function is 1.5533671265166362
Iteration 350: Value of Cost Function is 1.3730253404956603
Iteration 400: Value of Cost Function is 1.2803489722973787
Iteration 450: Value of Cost Function is 1.2212705755480584
Iteration 500: Value of Cost Function is 1.1794021107310177
Iteration 550: Value of Cost Function is 1.147752347037585
Iteration 600: Value of Cost Function is 1.12276586088434
Iteration 650: Value of Cost Function is 1.1024151891121186
Iteration 700: Value of Cost Function is 1.0854616699236224
Iteration 750: Value of Cost Function is 1.071036433368177
Iteration 800: Value of Cost Function is 1.058604345842621
Iteration 850: Value of Cost Function is 1.0477978673444914
Iteration 900: Value of Cost Function is 1.038290429912608
Iteration 950: Value of Cost Function is 1.029831423205263
Completed Low Dimensional Embedding: Final Value of Cost Function is 1.0225388914762061

Code

pd.DataFrame(solution)

	0	1
0	34.014055	1.542836
1	35.155558	-16.882073
2	13.360120	-2.866010
3	-8.949272	-14.304973
4	-16.397379	23.816197
...	...	...
995	-8.576599	-26.687632
996	-3.064378	32.732870
997	-19.259571	7.352250
998	-9.212013	19.369818
999	-4.722223	35.397239

1000 rows × 2 columns

where solution is the final 2-D mapping and Y is our mapped 2-D values at each step of the iteration. Plotting the evolution of Y where Y[-1] is our final 2-D mapping, we obtain (note how the algorithm behaves with early exaggeration on and off):

Code

fig, ax = plt.subplots()
ax.axis("off")
ax.set_title("MNIST t-SNE")
scat = ax.scatter(Y[1][:, 0], Y[1][:, 1], c=y_reduced, cmap="tab10")
plt.colorbar(scat, ax=ax)

# t-SNE Descent Animation
ys = []
prelims = list(range(0, 50, 5))
early_range = list(range(50, 250, 10))
mid_range_1 = list(range(250, 300, 5))
mid_range_2 = list(range(300, 400, 10))
end_range = list(range(400, 1000, 50))

visual_range = (
    prelims
    + early_range
    + mid_range_1
    + mid_range_2
    + end_range
    + [999, 999, 999, 999, 999, 999, 999]
)

for i in visual_range:
    ys.append(Y[i])

def strike(text):
    result = ""
    for c in text:
        result = result + c + "\u0336"
    return result

def animate(iterations):
    scat.set_offsets(ys[iterations])
    if iterations < 31:
        ax.text(
            0.05,
            1,
            "Early Exaggeration",
            horizontalalignment="center",
            verticalalignment="center",
            transform=ax.transAxes,
        )
    else:
        ax.text(
            0.05,
            1,
            strike("                  "),
            horizontalalignment="center",
            verticalalignment="center",
            transform=ax.transAxes,
        )

    ax.set_xlim(
        [
            1.25 * np.min(ys[iterations][:, 0]),
            1.25 * np.max(ys[iterations][:, 0]),
        ]
    )
    ax.set_ylim(
        [
            1.25 * np.min(ys[iterations][:, 1]),
            1.25 * np.max(ys[iterations][:, 1]),
        ]
    )

rot_animation = animation.FuncAnimation(
    fig, animate, frames=len(ys) - 1, interval=350, blit=False
)

rot_animation.save("data/MNIST.gif", dpi=300)

I recommend playing around with different values of the parameters (i.e., perplexity, learning rate, early exaggeration, etc.) to see how the solution differs (See the original paper and the scikit-learn documentation for guides on using these parameters).

Conclusion

And there you have it, we have coded t-SNE from scratch! I hope you have found this exercise to be illuminating into the inner workings of t-SNE and, at the very minimum, satisfying. Note that this implementation is not intended to be optimized for speed, but rather for understanding. Additions to the t-SNE algorithm have been implemented to improve computational speed and performance, such as variants of the Barnes-Hut algorithm (tree-based approaches), using PCA as the initialization of the embedding, or using additional gradient descent extensions such as adaptive learning rates. The implementation in scikit-learn makes use of many of these enhancements.

As always, I hope you have enjoyed reading this as much as I enjoyed writing it.

❖❖❖

Access all the code via this GitHub Repo.

I appreciate you reading my post! My posts primarily explore real-world and theoretical applications of econometric and statistical/machine learning techniques, but also whatever I am currently interested in or learning 😁. At the end of the day, I write to learn! I hope to make complex topics slightly more accessible to all.

Footnotes

van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579–2605, 2008.↩︎
LeCun et al. (1999): The MNIST Dataset Of Handwritten Digits (Images) License: CC BY-SA 3.0↩︎