Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv speed up

I am reading a large csv which has around 10 million rows and 20 different columns (with header names).

I have values, 2 columns with dates and some string.

Currently it takes me around 1.5 minutes to load the data with something like this:

df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')

I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.

I tried using HDF5 database, but it was just as slow.

Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):

Date    Comp     Rating Price   Estprice    Dividend?   Date_earnings   Returns
3/12/2017   Apple   Buy   100   114              Yes    4/4/2017    0.005646835
3/12/2017   Blackberry  Sell    120 97            No    4/25/2017   0.000775331
3/12/2017   Microsoft   Hold    140 100          Yes    5/28/2017   0.003028423

Thanks for the advice.

like image 265
MysterioProgrammer91 Avatar asked Mar 12 '17 15:03

MysterioProgrammer91


People also ask

How read CSV file faster?

Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on. And different CSV files will presumably have different parsing costs; this is just one example.

Is read_csv faster than Read_excel?

Python loads CSV files 100 times faster than Excel files. Use CSVs.

Is pandas faster than CSV writer?

Read and write CSV datasets 7 times faster than with Pandas But boy is it slow when it comes to reading and saving data files. It's a huge time waster, especially if your datasets measure gigabytes in size.


1 Answers

A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.

In [1]: %%time
        dfTest = pd.read_pickle('./testFile.pkl')
        print(dfTest.shape)
Out[1]: (10820089, 23)
        Wall time: 1.89 s

In [2]: %%time
        dfTest = pd.read_csv('./testFile.csv')
        print(dfTest.shape)
Out[2]: (10820089, 23)
        Wall time: 18.9 s

See the test file size differences used in this test here.

Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.

like image 166
Ryan Oz Avatar answered Oct 11 '22 17:10

Ryan Oz