Pandas - Slice large dataframe into chunks

Tags:

I have a large dataframe (>3MM rows) that I'm trying to pass through a function (the one below is largely simplified), and I keep getting a Memory Error message.

I think I'm passing too large of a dataframe into the function, so I'm trying to:

1) Slice the dataframe into smaller chunks (preferably sliced by AcctName)

2) Pass the dataframe into the function

3) Concatenate the dataframes back into one large dataframe

Click to copy

def trans_times_2(df):
    df['Double_Transaction'] = df['Transaction'] * 2

large_df 
AcctName   Timestamp    Transaction
ABC        12/1         12.12
ABC        12/2         20.89
ABC        12/3         51.93    
DEF        12/2         13.12
DEF        12/8          9.93
DEF        12/9         92.09
GHI        12/1         14.33
GHI        12/6         21.99
GHI        12/12        98.81

I know that my function works properly, since it will work on a smaller dataframe (e.g. 40,000 rows). I tried the following, but I was unsuccessful with concatenating the small dataframes back into one large dataframe.

Click to copy

def split_df(df):
    new_df = []
    AcctNames = df.AcctName.unique()
    DataFrameDict = {elem: pd.DataFrame for elem in AcctNames}
    key_list = [k for k in DataFrameDict.keys()]
    new_df = []
    for key in DataFrameDict.keys():
        DataFrameDict[key] = df[:][df.AcctNames == key]
        trans_times_2(DataFrameDict[key])
    rejoined_df = pd.concat(new_df)

How I envision the dataframes being split:

Click to copy

df1
AcctName   Timestamp    Transaction  Double_Transaction
ABC        12/1         12.12        24.24
ABC        12/2         20.89        41.78
ABC        12/3         51.93        103.86

df2
AcctName   Timestamp    Transaction  Double_Transaction
DEF        12/2         13.12        26.24
DEF        12/8          9.93        19.86
DEF        12/9         92.09        184.18

df3
AcctName   Timestamp    Transaction  Double_Transaction
GHI        12/1         14.33        28.66
GHI        12/6         21.99        43.98
GHI        12/12        98.81        197.62

550

asked Jun 23 '17 20:06

Walt Reed

3 Answers

You can use list comprehension to split your dataframe into smaller dataframes contained in a list.

Click to copy

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

Or use numpy array_split:

Click to copy

list_df = np.array_split(df, n)

You can access the chunks with:

Click to copy

list_df[0]
list_df[1]
etc...

Then you can assemble it back into a one dataframe using pd.concat.

By AcctName

Click to copy

list_df = []

for n,g in df.groupby('AcctName'):
    list_df.append(g)

138

answered Oct 23 '22 18:10

Scott Boston

I'd suggest using a dependency more_itertools. It handles all edge cases like uneven partition of the dataframe and returns an iterator that will make things a tiny bit more efficient.

(updated using code from @Acumenus)

Click to copy

from more_itertools import sliced
CHUNK_SIZE = 5

index_slices = sliced(range(len(df)), CHUNK_SIZE)

for index_slice in index_slices:
  chunk = df.iloc[index_slice] # your dataframe chunk ready for use

answered Oct 23 '22 18:10

ilykos

I love @ScottBoston answer, although, I still haven't memorized the incantation. Here's a more verbose function that does the same thing:

Click to copy

def chunkify(df: pd.DataFrame, chunk_size: int):
    start = 0
    length = df.shape[0]

    # If DF is smaller than the chunk, return the DF
    if length <= chunk_size:
        yield df[:]
        return

    # Yield individual chunks
    while start + chunk_size <= length:
        yield df[start:chunk_size + start]
        start = start + chunk_size

    # Yield the remainder chunk, if needed
    if start < length:
        yield df[start:]

To rebuild the data frame, accumulate each chunk in a list, then pd.concat(chunks, axis=1)

answered Oct 23 '22 18:10

rodrigo-silveira

Related questions
                            
                                Is there a more Pythonic way to combine an Else: statement and an Except:?
                            
                                add a row at top in pandas dataframe [duplicate]
                            
                                Python Disk-Based Dictionary
                            
                                How to get output from subprocess.Popen(). proc.stdout.readline() blocks, no data prints out
                            
                                How to write an empty indentation block in Python?
                            
                                Use of True, False, and None as return values in Python functions
                            
                                How to extract text and text coordinates from a PDF file?
                            
                                Making a chart bigger in size
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                How can I build a recursive function in python? [duplicate]
                            
                                NumPy: calculate averages with NaNs removed
                            
                                How to split list and pass them as separate parameter?
                            
                                How to generate unique 64 bits integers from Python?
                            
                                How to check if a module is installed in Python and, if not, install it within the code?
                            
                                FastAPI throws an error (Error loading ASGI app. Could not import module "api")
                            
                                Why requests raise this exception "check_hostname requires server_hostname"?
                            
                                Python OpenCV convert image to byte string?
                            
                                Pip default behavior conflicts with virtualenv?
                            
                                pymongo- How can I have distinct values for a field along with other query parameters
                            
                                Python / Pillow: How to scale an image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Slice large dataframe into chunks

Tags:

python

slice

pandas

dataframe

Walt Reed

People also ask

3 Answers

Scott Boston

ilykos

rodrigo-silveira

Recent Activity

Donate For Us