How to create a pivot table on extremely large dataframes in Pandas

Tags:

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.

How can I do a pivot on data this large with a limited ammount of RAM?

EDIT: adding sample code

The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.

import pandas as pd import numpy as np import random from pandas.io.pytables import * import os  pd.set_option('io.hdf.default_format','table')   # create a small dataframe to simulate the real data. def loadFrame():     frame = pd.DataFrame()     frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes     frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test     frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test     return frame  def pivotSegment(segmentNumber,passedFrame):     segmentSize = 3 #take 3 rows at a time     frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF      # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values     span = pd.DataFrame()      span['catid'] = range(1,5+1)     span['shipmentid']=1     span['qty']=0      frame = frame.append(span)      return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \                              aggfunc='sum',fill_value=0).reset_index()  def createStore():      store = pd.HDFStore('testdata.h5')     return store  segMin = 0 segMax = 4  store = createStore() frame = loadFrame()  print('Printing Frame') print(frame) print(frame.info())  for i in range(segMin,segMax):     segment = pivotSegment(i,frame)     store.append('data',frame[(i*3):(i*3 + 3)])     store.append('pivotedData',segment)  print('\nPrinting Store')    print(store) print('\nPrinting Store: data')  print(store['data']) print('\nPrinting Store: pivotedData')  print(store['pivotedData'])  print('**************') print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('**************') print('$$$') for df in store.select('pivotedData',chunksize=3):     print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())  print('$$$') store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3))) print('\nPrinting Store: pivotedAndSummed')  print(store['pivotedAndSummed'])  store.close() os.remove('testdata.h5') print('closed')

659

asked Apr 03 '15 20:04

TraxusIV

1 Answers

You could do the appending with HDF5/pytables. This keeps it out of RAM.

Use the table format:

store = pd.HDFStore('store.h5') for ...:     ...     chunk  # the chunk of the DataFrame (which you want to append)     store.append('df', chunk)

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df']

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it's cheap.

Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see below sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) # equivalent to (but doesn't read in the entire frame) store['df'].groupby().sum()

Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),        (df.groupby().sum() for df in store.select('df', chunksize=50000)))

In python 3 you must import reduce from functools.

Perhaps it's more pythonic/readable to write this as:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000)) res = next(chunks)  # will raise if there are no chunks! for c in chunks:     res = res.add(c, fill_value=0)

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

154

answered Oct 07 '22 18:10

Andy Hayden

Related questions
                            
                                How to Edit and Continue in ASP.Net MVC 6
                            
                                Postgres column does not exist but it's there with alias [duplicate]
                            
                                How to get available wifi network name in iOS using swift
                            
                                Is there an API for the Google Answer Boxes?
                            
                                async-await's continuations bursts — behave differently?
                            
                                Primitive types as objects in c++
                            
                                What is the correct way to load CSS staticfiles using npm and Django?
                            
                                Prevent Hibernate from deleting orphaned entities while merging an entity having entity associations with orphanRemoval set to true
                            
                                Handling errors in Swift
                            
                                Basic proxy authentication for HTTPS URLs returns HTTP/1.0 407 Proxy Authentication Required
                            
                                What is the sequence for tagging commit in git?
                            
                                Nginx to serve static page before dynamic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With