Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError in toarray when using DictVectorizer of Scikit Learn

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.

I get the following error:

File "FeatureExtraction.py", line 30, in <module>
    quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
    return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.

Any Help is appreciated!

like image 731
Gayatri Avatar asked Dec 07 '22 01:12

Gayatri


1 Answers

I figured out the problem.

When I removed a column which had a very high cardinality the DictVectorizer works fine. That column had like millions of different unique values and hence the dictvectorizer was giving a memory error.

like image 62
Gayatri Avatar answered Dec 08 '22 14:12

Gayatri