I have to create a function which would split provided dataframe into chunks of needed size. For instance if dataframe contains 1111 rows, I want to be able to specify chunk size of 400 rows, and get three smaller dataframes with sizes of 400, 400 and 311. Is there a convenience function to do the job? What would be the best way to store and iterate over sliced dataframe?
Example DataFrame
import numpy as np
import pandas as pd
test = pd.concat([pd.Series(np.random.rand(1111)), pd.Series(np.random.rand(1111))], axis = 1)
You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby
splitting the dataframe into equally sized chunks:
n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)
A more pythonic way to break large dataframes into smaller chunks based on fixed number of rows is to use list comprehension:
n = 400 #chunk row size
list_df = [test[i:i+n] for i in range(0,test.shape[0],n)]
[i.shape for i in list_df]
Output:
[(400, 2), (400, 2), (311, 2)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With