I have two sets of data train and test. The two data sets have 30213 and 30235 items respectively with 66 dimensions each.
I am trying to apply t-SNE of scikit learn to reduce the dimension to 2. Since the data sets are large and I get MemoryError if I try to process the entire data in one shot, I try to break them into chunks and transform one chunk at a time like this:
tsne = manifold.TSNE(n_components=2, perplexity=30, init='pca', random_state=0)
X_tsne_train = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_train.shape[0] ) ] )
X_tsne_test = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_test.shape[0] ) ] )
d = ( ( X_train, X_tsne_train ), ( X_test, X_tsne_test ) )
chunk = 5000
for Z in d:
x, x_tsne = Z[0], Z[1]
pstart, pend = 0, 0
while pend < x.shape[0]:
if pend + chunk < x.shape[0]:
pend = pstart + chunk
else:
pend = x.shape[0]
print 'pstart = ', pstart, 'pend = ', pend
x_part = x[pstart:pend]
x_tsne[pstart:pend] += tsne.fit_transform(x_part)
pstart = pend
It runs without MemoryError but I find that different runs of the script produce different outputs for the same data items. This could be due to the fit and transform operations happening together on each chunk of data. But if I try to fit on train data with tsne.fit(X_train)
, I get MemoryError
. How to correctly reduce the dimension of all data items in train and test sets to 2 without any incongruence among the chunks?
t-SNE is a nonlinear dimensionality technique that can be utilized in a scenario where the data is very high dimensional. Dimensionality reduction is one of the important parts of unsupervised learning in data science and machine learning.
t-SNE is mostly used to understand high-dimensional data and project it into low-dimensional space (like 2D or 3D). That makes it extremely useful when dealing with CNN networks.
t-SNE is also a method to reduce the dimension. One of the most major differences between PCA and t-SNE is it preserves only local similarities whereas PA preserves large pairwise distance maximize variance. It takes a set of points in high dimensional data and converts it into low dimensional data.
I am not entirely certain what you mean by "different outputs with the same data items", but here are some comments that might help you.
First, t-SNE is not really a "dimension reduction" technique in the same sense that PCA or other methods are. There is no way to take a fixed, learned t-SNE model and apply it to new data. (Note that the class has no transform()
method, only fit()
and fit_transform()
.) You will, therefore, be unable to use a "train" and "test" set.
Second, each and every time you call fit_transform()
you are getting a completely different model. The meaning of your reduced dimensions is, therefore, not consistent from chunk to chunk. Each chunk has its own little low-dimensional space. The model is different each time, and therefore the data are not being projected into the same space.
Third, you don't include the code where you divide "train" from "test". It may be that, while you are being careful to set the random seed of t-SNE, you are not setting the random seed of your train/test division, resulting in different data divisions, and thus different results on subsequent runs.
Finally, if you want to use t-SNE to visualize your data, you might consider following the advice on the documentation page, and applying PCA to reduce the dimensionality of the input from 66 to, say, 15. That would dramatically reduce the memory footprint of t-SNE.
TSNE in SKLearn Docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With