1. Definition
  2. t-sne in python
    1. Important Parameters

Definition

t-SNE stands for T-distributed stochastic neighbor embedding.

The basic functionality of t-sne is that it takes a higher dimensional set and reduces it to a lower dimension and then uses stochastic nearest neighbor to project the data in the lower dimension into its clusters. AKA it finds the distance of a point from every other point and maps it on a distribution graph to get the similarity between points. Important notes are that when finding the distances between each point it plots it on a t distribution graph because if it was on a normalised distribution graph the distance points have very low similarity values which is useless for the projection.

t-sne in python

Required imports:

from fastai.vision.all import *
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

See here for the t-SNE sklearn.manifold documentation

Important Parameters

  1. n_components This parameter allows you to select which dimension you wish to see the data in. In this case we are reducing the dimension to 2D so we set n_components = 2 because we want to project our visual learning model onto a 2D graph.

  2. Perplexity Perplexity is a very important parameter as it gives an estimate of the spread of data / the number of neighbours each point should account for when projecting to the new dimension. Perplexity should be less than the total number of points.

  3. Random State Number This is like the seed for the run so that each time we run it, we can reproduce the previous run instead of having a different random seed and producing different data. This way it is easier to compare and test other parameter changes.

Assuming you have already created the dls and run the learn and fine_tune functions, extract the reduced dimensions:

dl = learn.dls.valid  # or use learn.dls.test_dl() for custom data
features = []
labels = []

# Forward pass to get activations
for xb, yb in dl:
    with torch.no_grad():
        output = learn.model(xb)
        features.append(output.cpu())
        labels.extend(yb.cpu().numpy())

# Stack feature vectors
features = torch.cat(features).numpy()
label_mapping = {i: c for i, c in enumerate(learn.dls.vocab)}

Calling TSNE and adding it to a dataframe

tsne = TSNE(n_components=2, perplexity=i, random_state=42)
tsne_result = tsne.fit_transform(features)

df = pd.DataFrame()
df['tsne-2d-one'] = tsne_result[:,0]
df['tsne-2d-two'] = tsne_result[:,1]
df['label'] = labels
df['label_name'] = [label_mapping[i] for i in labels]

Creating plot

plt.figure(figsize=(10,7))
    sns.scatterplot(
        x='tsne-2d-one', y='tsne-2d-two',
        hue='label_name',
        palette=sns.color_palette("hsv", len(set(labels))),
        data=df,
        legend="full",
        alpha=0.7
    )

    plt.legend(
        title="Classes",
        bbox_to_anchor=(1.05, 1),
        loc='upper left',
        borderaxespad=0.
    )

    plt.title("t-SNE of Multiclass Fastai Learner, , perplexity=" + str(i))
    plt.show()

Helpful youtube videos: