Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to upload a big csv file in notebook to work with python pandas?

I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

location = r'C:\Users\Name\Folder_1\Folder_2\file.csv' df = pd.read_csv(location) 

This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

  • Upload in chunks and then put the chunks together.
  • HDF5
  • 'feather'
  • 'pickle'

I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

like image 550
hernanavella Avatar asked May 03 '16 17:05

hernanavella


People also ask

How do I import a large CSV file into Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

How do I import a large CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.


1 Answers

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV storage CSV               17.900    69.00              1.000 CSV.gzip          18.900   186.00              0.047 Pickle             0.173     1.77              0.374 HDF_fixed          0.196     2.03              0.435 HDF_tab            0.230     2.60              0.437 HDF_tab_zlib_c5    0.845     5.44              0.035 HDF_tab_zlib_c9    0.860     5.95              0.035 HDF_tab_bzip2_c5   2.500    36.50              0.011 HDF_tab_bzip2_c9   2.500    36.50              0.011 

reading

enter image description here

writing/saving

enter image description here

file size ratio in relation to uncompressed CSV file

enter image description here

RAW DATA:

CSV:

In [68]: %timeit df.to_csv(fcsv) 1 loop, best of 3: 1min 9s per loop  In [74]: %timeit pd.read_csv(fcsv) 1 loop, best of 3: 17.9 s per loop 

CSV.gzip:

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip') 1 loop, best of 3: 3min 6s per loop  In [75]: %timeit pd.read_csv(fcsv_gz) 1 loop, best of 3: 18.9 s per loop 

Pickle:

In [66]: %timeit df.to_pickle(fpckl) 1 loop, best of 3: 1.77 s per loop  In [72]: %timeit pd.read_pickle(fpckl) 10 loops, best of 3: 173 ms per loop 

HDF (format='fixed') [Default]:

In [67]: %timeit df.to_hdf(fh5, 'df') 1 loop, best of 3: 2.03 s per loop  In [73]: %timeit pd.read_hdf(fh5, 'df') 10 loops, best of 3: 196 ms per loop 

HDF (format='table'):

In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t') 1 loop, best of 3: 2.6 s per loop  In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df') 1 loop, best of 3: 230 ms per loop 

HDF (format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib') 1 loop, best of 3: 5.44 s per loop  In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df') 1 loop, best of 3: 854 ms per loop 

HDF (format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib') 1 loop, best of 3: 5.95 s per loop  In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df') 1 loop, best of 3: 860 ms per loop 

HDF (format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2') 1 loop, best of 3: 36.5 s per loop  In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df') 1 loop, best of 3: 2.5 s per loop 

HDF (format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2') 1 loop, best of 3: 36.5 s per loop  In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df') 1 loop, best of 3: 2.5 s per loop 

PS i can't test feather on my Windows notebook

DF info:

In [49]: df.shape Out[49]: (4000000, 6)  In [50]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4000000 entries, 0 to 3999999 Data columns (total 6 columns): a    datetime64[ns] b    datetime64[ns] c    datetime64[ns] d    datetime64[ns] e    datetime64[ns] f    datetime64[ns] dtypes: datetime64[ns](6) memory usage: 183.1 MB  In [41]: df.head() Out[41]:                     a                   b                   c  \ 0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00 1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00 2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00 3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00                      d                   e                   f 0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00 1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00 2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00 3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00 4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00 

File sizes:

{ .data }  » ls -lh 37010212.*                                                                          /d/temp/.data -rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv -rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz -rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5 -rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle -rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5 -rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5 -rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5 -rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5 -rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5 

Conclusion:

Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

like image 188
MaxU - stop WAR against UA Avatar answered Oct 04 '22 13:10

MaxU - stop WAR against UA