What is the fastest way to output large DataFrame into a CSV file?

Tags:

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min. I can sometimes improve performance by a factor of 7 like this:

def df2csv(df,fname,myformats=[],sep=','):   """     # function is faster than to_csv     # 7 times faster for numbers if formats are specified,      # 2 times faster for strings.     # Note - be careful. It doesn't add quotes and doesn't check     # for quotes or separators inside elements     # We've seen output time going down from 45 min to 6 min      # on a simple numeric 4-col dataframe with 45 million rows.   """   if len(df.columns) <= 0:     return   Nd = len(df.columns)   Nd_1 = Nd - 1   formats = myformats[:] # take a copy to modify it   Nf = len(formats)   # make sure we have formats for all columns   if Nf < Nd:     for ii in range(Nf,Nd):       coltype = df[df.columns[ii]].dtype       ff = '%s'       if coltype == np.int64:         ff = '%d'       elif coltype == np.float64:         ff = '%f'       formats.append(ff)   fh=open(fname,'w')   fh.write(','.join(df.columns) + '\n')   for row in df.itertuples(index=False):     ss = ''     for ii in xrange(Nd):       ss += formats[ii] % row[ii]       if ii < Nd_1:         ss += sep     fh.write(ss+'\n')   fh.close()  aa=DataFrame({'A':range(1000000)}) aa['B'] = aa.A + 1.0 aa['C'] = aa.A + 2.0 aa['D'] = aa.A + 3.0  timeit -r1 -n1 aa.to_csv('junk1')    # 52.9 sec timeit -r1 -n1 df2csv(aa,'junk3',myformats=['%d','%.1f','%.1f','%.1f']) #  7.5 sec

Note: the increase in performance depends on dtypes. But it is always true (at least in my tests) that to_csv() performs much slower than non-optimized python.

If I have a 45 million rows csv file, then:

aa = read_csv(infile)  #  1.5 min aa.to_csv(outfile)     # 45 min df2csv(aa,...)         # ~6 min

Questions:

What are the ways to make the output even faster? What's wrong with to_csv() ? Why is it soooo slow ?

Note: my tests were done using pandas 0.9.1 on a local drive on a Linux server.

474

asked Mar 14 '13 18:03

Lev Selector

1 Answers

Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

Results:                                             t_head  t_baseline      ratio name                                                                      frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849 write_csv_standard  (10k rows)             38.1940    234.2570     0.1630 frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286 frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951

So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')})  In [13]: df['B'] = df['A'] + 1.0    In [14]: df['C'] = df['A'] + 2.0 In [15]: df['D'] = df['A'] + 2.0 In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv') 1 loops, best of 1: 119 s per loop

168

answered Sep 21 '22 12:09

Jeff

Related questions
                            
                                In Python, None evaluates to less than zero? [duplicate]
                            
                                Multiple directories and/or subdirectories in IPython Notebook session?
                            
                                How to fix Pylint "Wrong hanging indentation" and PEP8 E121?
                            
                                Setting up a foreign key to an abstract base class with Django
                            
                                Python Assignment Operator Precedence - (a, b) = a[b] = {}, 5
                            
                                What does the colon inside the parameter mean? [duplicate]
                            
                                Which files should I tell my VCS to ignore when using Sphinx for documentation? [closed]
                            
                                How to sort a Python dictionary by value? [duplicate]
                            
                                Can you format pandas integers for display, like `pd.options.display.float_format` for floats?
                            
                                sklearn - Cross validation with multiple scores
                            
                                tensorflow deep neural network for regression always predict same results in one batch
                            
                                How to deploy python script?
                            
                                Python object conversion
                            
                                What do square brackets, "[]", mean in function/class documentation?
                            
                                How to force PyYAML to load strings as unicode objects?
                            
                                SHA-256 implementation in Python
                            
                                How to multiply functions in python?
                            
                                Django - exception handling best practice and sending customized error message
                            
                                Best video manipulation library for Python? [closed]
                            
                                Is there a library function in Python to turn a generator-function into a function returning a list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to output large DataFrame into a CSV file?

Tags:

performance

python

pandas

output

Lev Selector

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us