I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd import numpy as np import random from pandas.io.pytables import * import os pd.set_option('io.hdf.default_format','table') # create a small dataframe to simulate the real data. def loadFrame(): frame = pd.DataFrame() frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test return frame def pivotSegment(segmentNumber,passedFrame): segmentSize = 3 #take 3 rows at a time frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values span = pd.DataFrame() span['catid'] = range(1,5+1) span['shipmentid']=1 span['qty']=0 frame = frame.append(span) return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \ aggfunc='sum',fill_value=0).reset_index() def createStore(): store = pd.HDFStore('testdata.h5') return store segMin = 0 segMax = 4 store = createStore() frame = loadFrame() print('Printing Frame') print(frame) print(frame.info()) for i in range(segMin,segMax): segment = pivotSegment(i,frame) store.append('data',frame[(i*3):(i*3 + 3)]) store.append('pivotedData',segment) print('\nPrinting Store') print(store) print('\nPrinting Store: data') print(store['data']) print('\nPrinting Store: pivotedData') print(store['pivotedData']) print('**************') print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('**************') print('$$$') for df in store.select('pivotedData',chunksize=3): print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('$$$') store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3))) print('\nPrinting Store: pivotedAndSummed') print(store['pivotedAndSummed']) store.close() os.remove('testdata.h5') print('closed')
The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
Pivot tables are one of Excel's most powerful features. A pivot table allows us to draw insights from data. Pandas provides a similar function called pivot_table() . Pandas pivot_table() is a simple function but can produce very powerful analysis very quickly.
Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables. Pivot tables in pandas are popularly seen in MS Excel files. In python, Pivot tables of pandas dataframes can be created using the command: pandas. pivot_table .
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5') for ...: ... chunk # the chunk of the DataFrame (which you want to append) store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) # equivalent to (but doesn't read in the entire frame) store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce
with add
:
reduce(lambda x, y: x.add(y, fill_value=0), (df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000)) res = next(chunks) # will raise if there are no chunks! for c in chunks: res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With