Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jupyter notebook kernel dies when creating dummy variables with pandas

I am working on the Walmart Kaggle competition and I'm trying to create a dummy column of of the "FinelineNumber" column. For context, df.shape returns (647054, 7). I am trying to make a dummy column for df['FinelineNumber'], which has 5,196 unique values. The results should be a dataframe of shape (647054, 5196), which I then plan to concat to the original dataframe.

Nearly every time I run fineline_dummies = pd.get_dummies(df['FinelineNumber'], prefix='fl'), I get the following error message The kernel appears to have died. It will restart automatically. I am running python 2.7 in jupyter notebook on a MacBookPro with 16GB RAM.

Can someone explain why this is happening (and why it happens most of the time but not every time)? Is it a jupyter notebook or pandas bug? Also, I thought it might have to do with not enough RAM but I get the same error on a Microsoft Azure Machine Learning notebook with >100 GB of RAM. On Azure ML, the kernel dies every time - almost immediately.

like image 805
blahblahblah Avatar asked Dec 02 '15 16:12

blahblahblah


1 Answers

It very much could be memory usage - a 647054, 5196 data frame has 3,362,092,584 elements, which would be 24GB just for the pointers to the objects on a 64-bit system. On AzureML while the VM has a large amount of memory you're actually limited in how much memory you have available (currently 2GB, soon to be 4GB) - and when you hit the limit the kernel typically dies. So it seems very likely it is a memory usage issue.

You might try doing .to_sparse() on the data frame first before doing any additional manipulations. That should allow Pandas to keep most of the data frame out of memory.

like image 179
Dino Viehland Avatar answered Nov 02 '22 09:11

Dino Viehland