What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?
I thought about doing something like:
stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)):
start = i
end = i + stepsize-1 #neglect last row ...
df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')
But I bet there is a smarter solution out there?
As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.
Use id in the filename else it will not work. You missed id
, and without id
, it gives an error.
for id, df_i in enumerate(np.array_split(df, number_of_chunks)):
df_i.to_csv('/data/bs_{id}.csv'.format(id=id))
This answer brought me to a satisfying solution using:
numpy.array_split(object, number_of_chunks)
for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
chunk.to_csv(f'/data/bs_{idx}.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With