I am reading a large csv which has around 10 million rows and 20 different columns (with header names). I have values, 2 columns with dates and some string. Currently it takes me around 1.5 minutes to load the data with something like this: <pre class="prettyprint"><code>df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date') </code></pre> I want to ask, how can I make this significantly faster yet, have same dataframe once reading data. I tried using HDF5 database, but it was just as slow. Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows): <pre class="prettyprint"><code>Date Comp Rating Price Estprice Dividend? Date_earnings Returns 3/12/2017 Apple Buy 100 114 Yes 4/4/2017 0.005646835 3/12/2017 Blackberry Sell 120 97 No 4/25/2017 0.000775331 3/12/2017 Microsoft Hold 140 100 Yes 5/28/2017 0.003028423 </code></pre> Thanks for the advice.

A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl. <pre class="prettyprint"><code>In [1]: %%time dfTest = pd.read_pickle('./testFile.pkl') print(dfTest.shape) Out[1]: (10820089, 23) Wall time: 1.89 s In [2]: %%time dfTest = pd.read_csv('./testFile.csv') print(dfTest.shape) Out[2]: (10820089, 23) Wall time: 18.9 s </code></pre> See the test file size differences used in this test here. Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.

Pandas read_csv speed up

Tags:

python

database

pandas

I am reading a large csv which has around 10 million rows and 20 different columns (with header names).

I have values, 2 columns with dates and some string.

Currently it takes me around 1.5 minutes to load the data with something like this:

df = pd.read_csv('data.csv', index_col='date', parse_dates = 'date')

I want to ask, how can I make this significantly faster yet, have same dataframe once reading data.

I tried using HDF5 database, but it was just as slow.

Subset of the data I am trying to read (I chose 8 columns and gave 3 rows out of actual 20 columns and couple million rows):

Date    Comp     Rating Price   Estprice    Dividend?   Date_earnings   Returns
3/12/2017   Apple   Buy   100   114              Yes    4/4/2017    0.005646835
3/12/2017   Blackberry  Sell    120 97            No    4/25/2017   0.000775331
3/12/2017   Microsoft   Hold    140 100          Yes    5/28/2017   0.003028423

Thanks for the advice.

265

asked Mar 12 '17 15:03

MysterioProgrammer91

1 Answers

A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.

In [1]: %%time
        dfTest = pd.read_pickle('./testFile.pkl')
        print(dfTest.shape)
Out[1]: (10820089, 23)
        Wall time: 1.89 s

In [2]: %%time
        dfTest = pd.read_csv('./testFile.csv')
        print(dfTest.shape)
Out[2]: (10820089, 23)
        Wall time: 18.9 s

See the test file size differences used in this test here.

Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.

166

answered Oct 11 '22 17:10

Ryan Oz

Related questions
                            
                                error: [Errno 98] Address already in use
                            
                                Cache busting in Django 1.8?
                            
                                Gensim Word2vec : Semantic Similarity
                            
                                Redirect to other view after submitting form
                            
                                NLTK word tokenize behaviour for double quotation marks is confusing
                            
                                Is there a fast way to find (not necessarily recognize) human speech in an audio file?
                            
                                Error when loading rpy2 with anaconda
                            
                                Matplotlib imshow: how to apply a mask on the matrix
                            
                                Large Pandas Dataframe parallel processing
                            
                                How to avoid the deadlock in a subprocess without using communicate()
                            
                                Tensorflow slicing based on variable
                            
                                Python rolling log to a variable
                            
                                How to install Openpyxl with pip
                            
                                Pandas Rolling Window - datetime64[ns] are not implemented
                            
                                Names features importance plot after preprocessing
                            
                                How to call django.setup() in console_script?
                            
                                Python restplus API to upload and dowload files
                            
                                Getting {ValueError} 'a' must be 1-dimensoinal for list of lists from np.random.choice
                            
                                TypeError: src data type = 17 is not supported
                            
                                trying to make paths work - attempted relative import beyond top-level package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With