I am reading a large csv which has around 10 million rows and 20 different columns (with header names).
I have values, 2 columns with dates and some string.
Currently it takes me around 1.5 minutes to load the data with something like this:
df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')
I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.
I tried using HDF5 database, but it was just as slow.
Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):
Date Comp Rating Price Estprice Dividend? Date_earnings Returns
3/12/2017 Apple Buy 100 114 Yes 4/4/2017 0.005646835
3/12/2017 Blackberry Sell 120 97 No 4/25/2017 0.000775331
3/12/2017 Microsoft Hold 140 100 Yes 5/28/2017 0.003028423
Thanks for the advice.
Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on. And different CSV files will presumably have different parsing costs; this is just one example.
Python loads CSV files 100 times faster than Excel files. Use CSVs.
Read and write CSV datasets 7 times faster than with Pandas But boy is it slow when it comes to reading and saving data files. It's a huge time waster, especially if your datasets measure gigabytes in size.
A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.
In [1]: %%time
dfTest = pd.read_pickle('./testFile.pkl')
print(dfTest.shape)
Out[1]: (10820089, 23)
Wall time: 1.89 s
In [2]: %%time
dfTest = pd.read_csv('./testFile.csv')
print(dfTest.shape)
Out[2]: (10820089, 23)
Wall time: 18.9 s
See the test file size differences used in this test here.
Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With