Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scaling t-SNE to millions of observations in scikit-learn

t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.

I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.

t-SNE computes and stores the full, dense similarity matrix calculated for the input observations ( I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)

So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:

By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.

That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).

So am I missing something? Is it possible to scale this up to millions of datapoints?

like image 262
moustachio Avatar asked May 26 '16 02:05

moustachio


1 Answers

Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.

Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:

The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.

That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.

like image 187
Qusai Alothman Avatar answered Jan 12 '23 01:01

Qusai Alothman