What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?
I thought about doing something like:
stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)): 
    start = i 
    end = i + stepsize-1 #neglect last row ...
    df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')
But I bet there is a smarter solution out there?
As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.
Use id in the filename else it will not work. You missed id, and without id, it gives an error.
for id, df_i in  enumerate(np.array_split(df, number_of_chunks)):
    df_i.to_csv('/data/bs_{id}.csv'.format(id=id))
                        This answer brought me to a satisfying solution using:
numpy.array_split(object, number_of_chunks)for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
    chunk.to_csv(f'/data/bs_{idx}.csv')
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With