get_dummies python memory error

Tags:

I'm having a problem with a data set that has 400,000 rows and 300 variables. I have to get dummy variables for a categorical variable with 3,000+ different items. At the end I want to end up with a data set with 3,300 variables or features so that I can train a RandomForest model.

Here is what I've tried to do:

 df = pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1)

When I do that I'll always get an memory error. Is there a limit to the number of variables i can have?

If I do that with only the first 1,000 rows (which got 374 different categories) it just works fine.

Does anyone have a solution for my problem? The computer I'm using has 8 GB of memory.

900

asked Jul 09 '15 15:07

Duesentrieb

1 Answers

Update: Starting with version 0.19.0, get_dummies returns an 8bit integer rather than 64bit float, which will fix this problem in many cases and make the as_type solution below unnecessary. See: get_dummies -- pandas 0.19.0

But in other cases, the sparse option descibed below may still be helpful.

Original Answer: Here are a couple of possibilities to try. Both will reduce the memory footprint of the dataframe substantially but you could still run into memory issues later. It's hard to predict, you'll just have to try.

(note that I am simplifying the output of info() below)

df = pd.DataFrame({ 'itemID': np.random.randint(1,4,100) })

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 3.5 KB

Here's our baseline. Each dummy column takes up 800 bytes because the sample data has 100 rows and get_dummies appears to default to float64 (8 bytes). This seems like an unnecessarily inefficient way to store dummies as you could use as little as a bit to do it, but there may be some reason for that which I'm not aware of.

So, first attempt, just change to a one byte integer (this doesn't seem to be an option for get_dummies so it has to be done as a conversion with astype(np.int8).

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_').astype(np.int8)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null int8
itemID__2    100 non-null int8
itemID__3    100 non-null int8

memory usage: 1.5 KB

Each dummy column now takes up 1/8 the memory as before.

Alternatively, you can use the sparse option of get_dummies.

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_',sparse=True)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 2.0 KB

Fairly comparable savings. The info() output somewhat hides the way savings are occurring, but you can look at the value of memory usage to see to total savings.

Which of these will work better in practice will depend on your data, so you'll just need to give them each a try (or you could even combine them).

150

answered Oct 17 '22 13:10

JohnE

Related questions
                            
                                Send some keys to inactive window with python
                            
                                Pearson correlation coefficient 2-tailed p-value meaning [closed]
                            
                                How to disable pylint 'Undefined variable' error for a specific variable in a file?
                            
                                Why am I getting an error message in Python 'cannot import name NoneType'?
                            
                                Same module is being imported in different files
                            
                                Can literals in Python be overridden?
                            
                                Repeat a tuple inside a tuple
                            
                                Numpy: convert an array to a triangular matrix
                            
                                Why built-in functions like abs works on numpy array?
                            
                                What is a django.utils.functional.__proxy__ object and what it helps with?
                            
                                Reshaping an array to 2-D by specifying only the column size
                            
                                Python - Flask: render_template() not found [duplicate]
                            
                                selecting second child in beautiful soup with soup.select?
                            
                                When should I use function currying in Python?
                            
                                What is with this change of unpacking behavior from Python2 to Python3
                            
                                Flask - access the request in after_request or teardown_request
                            
                                PhantomJS returning empty web page (python, Selenium)
                            
                                animated subplots using matplotlib
                            
                                Nim equivalent of Python's list comprehension
                            
                                Skipping lines, csv.DictReader

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

get_dummies python memory error

Tags:

python

pandas

Duesentrieb

People also ask

1 Answers

JohnE

Recent Activity

Donate For Us