MemoryError in toarray when using DictVectorizer of Scikit Learn

Question

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.

I get the following error:

File "FeatureExtraction.py", line 30, in <module>
    quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
    return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.

Any Help is appreciated!

Gayatri · Accepted Answer

I figured out the problem.

When I removed a column which had a very high cardinality the DictVectorizer works fine. That column had like millions of different unique values and hence the dictvectorizer was giving a memory error.

MemoryError in toarray when using DictVectorizer of Scikit Learn

Tags:

python

scipy

scikit-learn

Gayatri

1 Answers

Gayatri

Recent Activity

Donate For Us

MemoryError in toarray when using DictVectorizer of Scikit Learn

Tags:

python

scipy

scikit-learn

Gayatri

1 Answers

Gayatri

Related questions

Recent Activity

Donate For Us