Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to output large DataFrame into a CSV file?

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min. I can sometimes improve performance by a factor of 7 like this:

def df2csv(df,fname,myformats=[],sep=','):   """     # function is faster than to_csv     # 7 times faster for numbers if formats are specified,      # 2 times faster for strings.     # Note - be careful. It doesn't add quotes and doesn't check     # for quotes or separators inside elements     # We've seen output time going down from 45 min to 6 min      # on a simple numeric 4-col dataframe with 45 million rows.   """   if len(df.columns) <= 0:     return   Nd = len(df.columns)   Nd_1 = Nd - 1   formats = myformats[:] # take a copy to modify it   Nf = len(formats)   # make sure we have formats for all columns   if Nf < Nd:     for ii in range(Nf,Nd):       coltype = df[df.columns[ii]].dtype       ff = '%s'       if coltype == np.int64:         ff = '%d'       elif coltype == np.float64:         ff = '%f'       formats.append(ff)   fh=open(fname,'w')   fh.write(','.join(df.columns) + '\n')   for row in df.itertuples(index=False):     ss = ''     for ii in xrange(Nd):       ss += formats[ii] % row[ii]       if ii < Nd_1:         ss += sep     fh.write(ss+'\n')   fh.close()  aa=DataFrame({'A':range(1000000)}) aa['B'] = aa.A + 1.0 aa['C'] = aa.A + 2.0 aa['D'] = aa.A + 3.0  timeit -r1 -n1 aa.to_csv('junk1')    # 52.9 sec timeit -r1 -n1 df2csv(aa,'junk3',myformats=['%d','%.1f','%.1f','%.1f']) #  7.5 sec 

Note: the increase in performance depends on dtypes. But it is always true (at least in my tests) that to_csv() performs much slower than non-optimized python.

If I have a 45 million rows csv file, then:

aa = read_csv(infile)  #  1.5 min aa.to_csv(outfile)     # 45 min df2csv(aa,...)         # ~6 min 

Questions:

What are the ways to make the output even faster? What's wrong with to_csv() ? Why is it soooo slow ? 

Note: my tests were done using pandas 0.9.1 on a local drive on a Linux server.

like image 474
Lev Selector Avatar asked Mar 14 '13 18:03

Lev Selector


People also ask

Which is faster pandas or CSV?

As @chrisb said, pandas' read_csv is probably faster than csv.

Is PyArrow faster than pandas?

To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format.


1 Answers

Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

Results:                                             t_head  t_baseline      ratio name                                                                      frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849 write_csv_standard  (10k rows)             38.1940    234.2570     0.1630 frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286 frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951 

So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')})  In [13]: df['B'] = df['A'] + 1.0    In [14]: df['C'] = df['A'] + 2.0 In [15]: df['D'] = df['A'] + 2.0 In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv') 1 loops, best of 1: 119 s per loop 
like image 168
Jeff Avatar answered Sep 21 '22 12:09

Jeff