What is the fastest way to upload a big csv file in notebook to work with python pandas?

Tags:

I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

location = r'C:\Users\Name\Folder_1\Folder_2\file.csv' df = pd.read_csv(location)

This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

Upload in chunks and then put the chunks together.
HDF5
'feather'
'pickle'

I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

550

asked May 03 '16 17:05

hernanavella

1 Answers

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV storage CSV               17.900    69.00              1.000 CSV.gzip          18.900   186.00              0.047 Pickle             0.173     1.77              0.374 HDF_fixed          0.196     2.03              0.435 HDF_tab            0.230     2.60              0.437 HDF_tab_zlib_c5    0.845     5.44              0.035 HDF_tab_zlib_c9    0.860     5.95              0.035 HDF_tab_bzip2_c5   2.500    36.50              0.011 HDF_tab_bzip2_c9   2.500    36.50              0.011

reading

enter image description here

writing/saving

enter image description here

file size ratio in relation to uncompressed CSV file

enter image description here

RAW DATA:

CSV:

In [68]: %timeit df.to_csv(fcsv) 1 loop, best of 3: 1min 9s per loop  In [74]: %timeit pd.read_csv(fcsv) 1 loop, best of 3: 17.9 s per loop

CSV.gzip:

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip') 1 loop, best of 3: 3min 6s per loop  In [75]: %timeit pd.read_csv(fcsv_gz) 1 loop, best of 3: 18.9 s per loop

Pickle:

In [66]: %timeit df.to_pickle(fpckl) 1 loop, best of 3: 1.77 s per loop  In [72]: %timeit pd.read_pickle(fpckl) 10 loops, best of 3: 173 ms per loop

HDF (format='fixed') [Default]:

In [67]: %timeit df.to_hdf(fh5, 'df') 1 loop, best of 3: 2.03 s per loop  In [73]: %timeit pd.read_hdf(fh5, 'df') 10 loops, best of 3: 196 ms per loop

HDF (format='table'):

In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t') 1 loop, best of 3: 2.6 s per loop  In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df') 1 loop, best of 3: 230 ms per loop

HDF (format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib') 1 loop, best of 3: 5.44 s per loop  In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df') 1 loop, best of 3: 854 ms per loop

HDF (format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib') 1 loop, best of 3: 5.95 s per loop  In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df') 1 loop, best of 3: 860 ms per loop

HDF (format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2') 1 loop, best of 3: 36.5 s per loop  In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df') 1 loop, best of 3: 2.5 s per loop

HDF (format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2') 1 loop, best of 3: 36.5 s per loop  In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df') 1 loop, best of 3: 2.5 s per loop

PS i can't test feather on my Windows notebook

DF info:

In [49]: df.shape Out[49]: (4000000, 6)  In [50]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4000000 entries, 0 to 3999999 Data columns (total 6 columns): a    datetime64[ns] b    datetime64[ns] c    datetime64[ns] d    datetime64[ns] e    datetime64[ns] f    datetime64[ns] dtypes: datetime64[ns](6) memory usage: 183.1 MB  In [41]: df.head() Out[41]:                     a                   b                   c  \ 0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00 1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00 2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00 3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00                      d                   e                   f 0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00 2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00 3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00 4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

File sizes:

{ .data }  » ls -lh 37010212.*                                                                          /d/temp/.data -rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv -rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz -rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5 -rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle -rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5 -rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5 -rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5 -rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5 -rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5

Conclusion:

Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

188

answered Oct 04 '22 13:10

MaxU - stop WAR against UA

Related questions
                            
                                Filtering grouped DataFrame in Pandas
                            
                                Python syntax for an empty while loop
                            
                                How to extract phrases from corpus using gensim
                            
                                How to convert numpy arrays to standard TensorFlow format?
                            
                                Django: Best way to implement "status" field in modules
                            
                                What is the difference between dict and collections.defaultdict?
                            
                                Pythonic solution for conditional arguments passing
                            
                                what is {% block content %} and {% endblock content %} for in Django?
                            
                                How to make a surface with a transparent background in pygame
                            
                                Receive and send emails in python
                            
                                How can I convert XML into a Python object?
                            
                                How can I implement decrease-key functionality in Python's heapq?
                            
                                Is it possible to generate a diagram of an entire Django webapp? [closed]
                            
                                What does "print >>" do in python? [duplicate]
                            
                                Function application over numpy's matrix row/column
                            
                                Python newbie @patch decorator issue
                            
                                Sum all columns with a wildcard name search using Python Pandas
                            
                                Keras + Tensorflow and Multiprocessing in Python
                            
                                Mapping columns from one dataframe to another to create a new column [duplicate]
                            
                                Virtualenv specific pip config files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to upload a big csv file in notebook to work with python pandas?

Tags:

python

pandas

dataframe

csv

hernanavella

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us