Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

memory error during hierarchical clustering Python 3.6

Tags:

I have a fairly large data set (1841000*32 matrix) I wish to run a hierarchical clustering algorithm on. Both the AgglomerativeClustering class and the FeatureAgglomeration class in sklearn.cluster give the below error.

    ---------------------------------------------------------------------------
    MemoryError                               Traceback (most recent call last)
    <ipython-input-10-85ab7b694cf1> in <module>()
  1 
  2 
    ----> 3 mat_red = manifold.SpectralEmbedding(n_components=2).fit_transform(mat)
  4 clustering.fit(mat_red,y = None)

 ~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit_transform(self, X, y)
525         X_new : array-like, shape (n_samples, n_components)
526         """
--> 527         self.fit(X)
528         return self.embedding_

  ~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in fit(self, X, y)
498                               "name or a callable. Got: %s") % self.affinity)
499 
--> 500         affinity_matrix = self._get_affinity_matrix(X)
501         self.embedding_ = spectral_embedding(affinity_matrix,
502   n_components=self.n_components,

~/anaconda3/lib/python3.6/site-packages/sklearn/manifold/spectral_embedding_.py in _get_affinity_matrix(self, X, Y)
450     self.affinity_matrix_ = kneighbors_graph(X, self.n_neighbors_,
451                                                          include_self=True,
--> 452                                                          n_jobs=self.n_jobs)
453                 # currently only symmetric affinity_matrix supported
454                 self.affinity_matrix_ = 0.5 * (self.affinity_matrix_ +

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/graph.py in kneighbors_graph(X, n_neighbors, mode, metric, p, metric_params, include_self, n_jobs)
101 
102     query = _query_include_self(X, include_self)
--> 103     return X.kneighbors_graph(X=query, n_neighbors=n_neighbors, mode=mode)
104 
105 

~/anaconda3/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors_graph(self, X, n_neighbors, mode)
482         # construct CSR matrix representation of the k-NN graph
483         if mode == 'connectivity':
--> 484             A_data = np.ones(n_samples1 * n_neighbors)
485             A_ind = self.kneighbors(X, n_neighbors, return_distance=False)
486 

   ~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in ones(shape, dtype, order)
186 
187     """
    --> 188     a = empty(shape, dtype, order)
189     multiarray.copyto(a, 1, casting='unsafe')
190     return a

MemoryError: 

My RAM is 8GB, and the same error occurred when i ran it on a 64GB system. I realize hierarchical clustering is computationally expensive, and not recommended for large datasets, but I need to create a dendrogram of all my data at once. I am creating a vocabulary tree from a bag of visual words using ORB features. If there is any other way to achieve this or a way to fix the error, please illuminate! Thank you.

like image 786
Deepti Hegde Avatar asked Jul 02 '18 05:07

Deepti Hegde


1 Answers

I ran into a similar issue running agglomerative clustering. My solution was to run the clustering algorithm on a small subset of the data using train_test_split, then use KNN to extend the labels from AC to the rest of the data. Works reasonably well, not sure if the data you are using is amenable to that treatment or not. My code for extending is:

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                     test_size=test_size, random_state=42) 
AC = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
AC.fit(X_train)
labels = AC.labels_

KN = KNeighborsClassifier(n_neighbors=n_neighbors)
KN.fit(X_train,labels)
labels2 = KN.predict(X)
like image 139
rwalroth Avatar answered Oct 04 '22 15:10

rwalroth