# DIMENSIONALITY REDUCTION

Michael Staniek - February 21st, 2019

## TAGS

In this blogpost we want to learn how to do dimensionality reduction for datasets.

This can be used to visualise word embeddings or other data with more than 2 or 3 dimensions.

## Dimensionality reduction for 2d and 3d visualisation

For this, 2 algorithms in particular, T-SNE and PCA, are easy to use because they are already implemented in sklearn.

First, a dataset has to be loaded, in this case lets use a simple dataset from sklearn:

In [1]:
%matplotlib notebook

from sklearn.datasets import load_digits

print("Data:")
print(digits.data)
print("Maximum Value:")
print(digits.data.max())
print("Normalized Data:")
print(digits.data/digits.data.max())

data = (digits.data/digits.data.max())[:500]
labels = digits.target[:500]

print(labels)

Data:
[[  0.   0.   5. ...,   0.   0.   0.]
[  0.   0.   0. ...,  10.   0.   0.]
[  0.   0.   0. ...,  16.   9.   0.]
...,
[  0.   0.   1. ...,   6.   0.   0.]
[  0.   0.   2. ...,  12.   0.   0.]
[  0.   0.  10. ...,  12.   1.   0.]]
Maximum Value:
16.0
Normalized Data:
[[ 0.      0.      0.3125 ...,  0.      0.      0.    ]
[ 0.      0.      0.     ...,  0.625   0.      0.    ]
[ 0.      0.      0.     ...,  1.      0.5625  0.    ]
...,
[ 0.      0.      0.0625 ...,  0.375   0.      0.    ]
[ 0.      0.      0.125  ...,  0.75    0.      0.    ]
[ 0.      0.      0.625  ...,  0.75    0.0625  0.    ]]
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0
9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9
5 2 8 2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4
4 7 2 8 2 2 5 7 9 5 4 8 8 4 9 0 8 9 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2
0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8 2 0 0 1 7 6 3 2 1 7 3 1 3 9 1
7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2 8 2 2 5 5 4 8 8 4 9 0 8 9 8 0 1 2
3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9
8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2 6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8
2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7 6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2
8 2 2 5 7 9 5 4 8 8 4 9 0 8 9 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9 0 9 5 5 6 5 0 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2
6 3 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8 2 0 0 1 7 6 3 2 1 7 4 6 3 1 3 9 1 7
6 8 4 3 1 4 0 5 3 6 9 6 1 7 5 4 4 7 2]


Now we need to do dimensionality reduction on the data:

In [2]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

twod_pca_data = TSNE(n_components=2, perplexity=100.0).fit_transform(data)
threed_pca_data = TSNE(n_components=3, perplexity=100.0).fit_transform(data)


### 2D visualisation

Now we try a simple visualisation for two dimensional data, where every class has its own color.

In [4]:
import matplotlib.pyplot as plt

for label in set(digits.target_names):
data_for_label = twod_pca_data[labels == label]
plt.scatter(data_for_label[:, 0], data_for_label[:, 1], label=str(label))
plt.legend()
plt.tight_layout()
plt.show()


### 3D visualisation

We can do the same, using the data that has been reduced to three dimensions, to generate a 3D plot.

This is a bit more complicated codewise, but if you have it done once, its very easy to replicate.

In [5]:
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')
print(ax)
for label in set(digits.target_names):
data_for_label = threed_pca_data[labels == label]
ax.scatter(data_for_label[:, 0], data_for_label[:, 1], data_for_label[:, 2], label=str(label), s=300)
plt.legend()
plt.tight_layout()
plt.show()

Axes(0.125,0.11;0.775x0.77)