Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a pivot table on extremely large dataframes in Pandas

Tags:

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.

How can I do a pivot on data this large with a limited ammount of RAM?

EDIT: adding sample code

The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.

import pandas as pd import numpy as np import random from pandas.io.pytables import * import os  pd.set_option('io.hdf.default_format','table')   # create a small dataframe to simulate the real data. def loadFrame():     frame = pd.DataFrame()     frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes     frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test     frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test     return frame  def pivotSegment(segmentNumber,passedFrame):     segmentSize = 3 #take 3 rows at a time     frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF      # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values     span = pd.DataFrame()      span['catid'] = range(1,5+1)     span['shipmentid']=1     span['qty']=0      frame = frame.append(span)      return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \                              aggfunc='sum',fill_value=0).reset_index()  def createStore():      store = pd.HDFStore('testdata.h5')     return store  segMin = 0 segMax = 4  store = createStore() frame = loadFrame()  print('Printing Frame') print(frame) print(frame.info())  for i in range(segMin,segMax):     segment = pivotSegment(i,frame)     store.append('data',frame[(i*3):(i*3 + 3)])     store.append('pivotedData',segment)  print('\nPrinting Store')    print(store) print('\nPrinting Store: data')  print(store['data']) print('\nPrinting Store: pivotedData')  print(store['pivotedData'])  print('**************') print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('**************') print('$$$') for df in store.select('pivotedData',chunksize=3):     print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())  print('$$$') store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3))) print('\nPrinting Store: pivotedAndSummed')  print(store['pivotedAndSummed'])  store.close() os.remove('testdata.h5') print('closed') 
like image 659
TraxusIV Avatar asked Apr 03 '15 20:04

TraxusIV


People also ask

Is there a size limit for pandas DataFrame?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

What is difference between pivot and pivot table in pandas?

Pivot tables are one of Excel's most powerful features. A pivot table allows us to draw insights from data. Pandas provides a similar function called pivot_table() . Pandas pivot_table() is a simple function but can produce very powerful analysis very quickly.

What is pivot_table in pandas?

Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables. Pivot tables in pandas are popularly seen in MS Excel files. In python, Pivot tables of pandas dataframes can be created using the command: pandas. pivot_table .


1 Answers

You could do the appending with HDF5/pytables. This keeps it out of RAM.

Use the table format:

store = pd.HDFStore('store.h5') for ...:     ...     chunk  # the chunk of the DataFrame (which you want to append)     store.append('df', chunk) 

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df'] 

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it's cheap.


Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see below sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) # equivalent to (but doesn't read in the entire frame) store['df'].groupby().sum() 

Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),        (df.groupby().sum() for df in store.select('df', chunksize=50000))) 

In python 3 you must import reduce from functools.

Perhaps it's more pythonic/readable to write this as:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000)) res = next(chunks)  # will raise if there are no chunks! for c in chunks:     res = res.add(c, fill_value=0) 

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

like image 154
Andy Hayden Avatar answered Oct 07 '22 18:10

Andy Hayden