I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.
I get the following error:
File "FeatureExtraction.py", line 30, in <module>
quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.
Any Help is appreciated!
I figured out the problem.
When I removed a column which had a very high cardinality the DictVectorizer works fine. That column had like millions of different unique values and hence the dictvectorizer was giving a memory error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With