Goal: I aim to use t-SNE (t-distributed Stochastic Neighbor Embedding) in R for dimensionality reduction of my training data (with N observations and K variables, where K>>N) and subsequently aim to come up with the t-SNE representation for my test data.
Example: Suppose I aim to reduce the K variables to D=2 dimensions (often, D=2 or D=3 for t-SNE). There are two R packages: Rtsne
and tsne
, while I use the former here.
# load packages
library(Rtsne)
# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)
# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)
# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)
where the command fit.tsne$Y
will return the (100x2)-dimensional object containing the t-SNE representation of the data; can also be plotted via plot(fit.tsne$Y)
.
Problem: Now, what I am looking for is a function that returns a prediction pred
of dimension (1x2) for my test data based on the trained t-SNE model. Something like,
# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)
(How) Is this possible? Can you help me out with this?
t-SNE does not really work this way:
The following is an expert from the t-SNE author's website (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
You may be interested in his paper: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
This website in addition to being really cool offers a wealth of info about t-SNE: http://distill.pub/2016/misread-tsne/
On Kaggle I have also seen people do things like this which may also be of intrest: https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne
This the mail answer from the author (Jesse Krijthe) of the Rtsne package:
Thank you for the very specific question. I had an earlier request for this and it is noted as an open issue on GitHub (https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am hesitant to implement something like this is that, in a sense, there is no 'natural' way explain what a prediction means in terms of tsne. To me, tsne is a way to visualize a distance matrix. As such, a new sample would lead to a new distance matrix and hence a new visualization. So, my current thinking is that the only sensible way would be to rerun the tsne procedure on the train and test set combined.
Having said that, other people do think it makes sense to define predictions, for instance by keeping the train objects fixed in the map and finding good locations for the test objects (as was suggested in the issue). An approach I would personally prefer over this would be something like parametric tsne, which Laurens van der Maaten (the author of the tsne paper) explored a paper. However, this would best be implemented using something else than my package, because the parametric model is likely most effective if it is selected by the user.
So my suggestion would be to 1) refit the mapping using all data or 2) see if you can find an implementation of parametric tsne, the only one I know of would be Laurens's Matlab implementation.
Sorry I can not be of more help. If you come up with any other/better solutions, please let me know.
From the author himself (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).
So you can't directly apply new data points. However, you can fit a multivariate regression model between your data and the embedded dimensions. The author recognizes that it's a limitation of the method and suggests this way to get around it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With