Dimension reduction with t-SNE

Tags:

scikit-learn

I have two sets of data train and test. The two data sets have 30213 and 30235 items respectively with 66 dimensions each.

I am trying to apply t-SNE of scikit learn to reduce the dimension to 2. Since the data sets are large and I get MemoryError if I try to process the entire data in one shot, I try to break them into chunks and transform one chunk at a time like this:

tsne = manifold.TSNE(n_components=2, perplexity=30, init='pca', random_state=0)

X_tsne_train = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_train.shape[0] ) ] )

X_tsne_test = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_test.shape[0] ) ] )

d = ( ( X_train, X_tsne_train ), ( X_test, X_tsne_test ) )

chunk = 5000

for Z in d:

        x, x_tsne = Z[0], Z[1]
        pstart, pend = 0, 0
        while pend < x.shape[0]:
                if pend + chunk < x.shape[0]:
                        pend = pstart + chunk
                else:
                        pend = x.shape[0]
                print 'pstart = ', pstart, 'pend = ', pend
                x_part = x[pstart:pend]
                x_tsne[pstart:pend] += tsne.fit_transform(x_part)
                pstart = pend

It runs without MemoryError but I find that different runs of the script produce different outputs for the same data items. This could be due to the fit and transform operations happening together on each chunk of data. But if I try to fit on train data with tsne.fit(X_train), I get MemoryError. How to correctly reduce the dimension of all data items in train and test sets to 2 without any incongruence among the chunks?

291

asked Aug 19 '15 20:08

D Lyzer

1 Answers

I am not entirely certain what you mean by "different outputs with the same data items", but here are some comments that might help you.

First, t-SNE is not really a "dimension reduction" technique in the same sense that PCA or other methods are. There is no way to take a fixed, learned t-SNE model and apply it to new data. (Note that the class has no transform() method, only fit() and fit_transform().) You will, therefore, be unable to use a "train" and "test" set.

Second, each and every time you call fit_transform() you are getting a completely different model. The meaning of your reduced dimensions is, therefore, not consistent from chunk to chunk. Each chunk has its own little low-dimensional space. The model is different each time, and therefore the data are not being projected into the same space.

Third, you don't include the code where you divide "train" from "test". It may be that, while you are being careful to set the random seed of t-SNE, you are not setting the random seed of your train/test division, resulting in different data divisions, and thus different results on subsequent runs.

Finally, if you want to use t-SNE to visualize your data, you might consider following the advice on the documentation page, and applying PCA to reduce the dimensionality of the input from 66 to, say, 15. That would dramatically reduce the memory footprint of t-SNE.

TSNE in SKLearn Docs

answered Sep 18 '22 01:09

Andreus

Related questions
                            
                                Mask area outside of imported shapefile (basemap/matplotlib)
                            
                                Python _winreg key path incorrect
                            
                                Python gzip: OverflowError size does not fit in an int
                            
                                How to collect output from a Python subprocess
                            
                                triggering different app environments with pyenv-virtualenv
                            
                                Fix Character encoding of webpage using python Mechanize
                            
                                How to save up memory while using Multiprocessing in Python?
                            
                                Installing github version of package with Anaconda
                            
                                Python 3.4 decode bytes
                            
                                Time complexity of swapping elements in a python list
                            
                                How to expand a string within a string in python?
                            
                                django rest framework 3 ImageField send ajax result “No file was submitted.”
                            
                                Python 3 join data from large files that are sorted
                            
                                Pandas: why pandas.Series.std() is quite different from numpy.std()
                            
                                How to resample a df with datetime index to exactly n equally sized periods?
                            
                                python process pool with timeout on each process not all of the pool
                            
                                How to keep a request context in a celery task, in Python Flask?
                            
                                Get Python type of Django's model field?
                            
                                Error importing MySQL package for Python
                            
                                How to write hdf5 files without overwriting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With