Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe pivot not fitting in memory

I have a dataframe df with the following structure:

        val          newidx    Code
Idx                             
0       1.0      1220121127    706
1       1.0      1220121030    706
2       1.0      1620120122    565

It has 1000000 lines. In total we have 600 unique Code value and 200000 unique newidx values.

If I perform the following operation

df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')

I get a MemoryError . but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.

How much memory requires such operation? Is there a way to fix this memory error?

like image 666
Ulderique Demoitre Avatar asked Sep 22 '16 21:09

Ulderique Demoitre


Video Answer


2 Answers

Try to see if this fits in your memory:

df.groupby(['newidx', 'Code'])['val'].max().unstack()

pivot_table is unfortunately very memory intensive as it may make multiple copies of data.


If the groupby does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:

df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()

avoids multiple assignments.

like image 170
Kartik Avatar answered Sep 24 '22 04:09

Kartik


I've had a very similar problem when carrying out a merge between 4 dataframes recently.

What worked for me was disabling the index during the groupby, then merging.

if @Kartiks answer doesn't work, try this before chunking the DataFrame.

df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()
like image 25
mproffitt Avatar answered Sep 21 '22 04:09

mproffitt