I have a large dataframe (>3MM rows) that I'm trying to pass through a function (the one below is largely simplified), and I keep getting a Memory Error
message.
I think I'm passing too large of a dataframe into the function, so I'm trying to:
1) Slice the dataframe into smaller chunks (preferably sliced by AcctName
)
2) Pass the dataframe into the function
3) Concatenate the dataframes back into one large dataframe
def trans_times_2(df):
df['Double_Transaction'] = df['Transaction'] * 2
large_df
AcctName Timestamp Transaction
ABC 12/1 12.12
ABC 12/2 20.89
ABC 12/3 51.93
DEF 12/2 13.12
DEF 12/8 9.93
DEF 12/9 92.09
GHI 12/1 14.33
GHI 12/6 21.99
GHI 12/12 98.81
I know that my function works properly, since it will work on a smaller dataframe (e.g. 40,000 rows). I tried the following, but I was unsuccessful with concatenating the small dataframes back into one large dataframe.
def split_df(df):
new_df = []
AcctNames = df.AcctName.unique()
DataFrameDict = {elem: pd.DataFrame for elem in AcctNames}
key_list = [k for k in DataFrameDict.keys()]
new_df = []
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.AcctNames == key]
trans_times_2(DataFrameDict[key])
rejoined_df = pd.concat(new_df)
How I envision the dataframes being split:
df1
AcctName Timestamp Transaction Double_Transaction
ABC 12/1 12.12 24.24
ABC 12/2 20.89 41.78
ABC 12/3 51.93 103.86
df2
AcctName Timestamp Transaction Double_Transaction
DEF 12/2 13.12 26.24
DEF 12/8 9.93 19.86
DEF 12/9 92.09 184.18
df3
AcctName Timestamp Transaction Double_Transaction
GHI 12/1 14.33 28.66
GHI 12/6 21.99 43.98
GHI 12/12 98.81 197.62
Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.
Slicing a DataFrame in Pandas includes the following steps:Ensure Python is installed (or install ActivePython) Import a dataset. Create a DataFrame. Slice the DataFrame.
You can use list comprehension to split your dataframe into smaller dataframes contained in a list.
n = 200000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
Or use numpy array_split
:
list_df = np.array_split(df, n)
You can access the chunks with:
list_df[0]
list_df[1]
etc...
Then you can assemble it back into a one dataframe using pd.concat.
By AcctName
list_df = []
for n,g in df.groupby('AcctName'):
list_df.append(g)
I'd suggest using a dependency more_itertools
. It handles all edge cases like uneven partition of the dataframe and returns an iterator that will make things a tiny bit more efficient.
(updated using code from @Acumenus)
from more_itertools import sliced
CHUNK_SIZE = 5
index_slices = sliced(range(len(df)), CHUNK_SIZE)
for index_slice in index_slices:
chunk = df.iloc[index_slice] # your dataframe chunk ready for use
I love @ScottBoston answer, although, I still haven't memorized the incantation. Here's a more verbose function that does the same thing:
def chunkify(df: pd.DataFrame, chunk_size: int):
start = 0
length = df.shape[0]
# If DF is smaller than the chunk, return the DF
if length <= chunk_size:
yield df[:]
return
# Yield individual chunks
while start + chunk_size <= length:
yield df[start:chunk_size + start]
start = start + chunk_size
# Yield the remainder chunk, if needed
if start < length:
yield df[start:]
To rebuild the data frame, accumulate each chunk in a list, then pd.concat(chunks, axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With