Does Indexing makes Slice of pandas dataframe faster?

Tags:

python

pandas

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...

I need to effectively get the record during the specific period. The following naive way is very time consuming.

new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]

I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.

My question is

Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?

619

asked Dec 02 '15 03:12

Light Yagmi

1 Answers

Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.

df = pd.DataFrame(np.random.randn(1000000, 3), 
                  columns=list('ABC'), 
                  index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))

Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):

%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)

%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)

%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)

%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)

Here are the timings with the Datetime index as a column:

df.reset_index(inplace=True)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)

All of the above indexing techniques produce the same dataframe:

>>> df2.shape
(250560, 3)

It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.

answered Sep 18 '22 17:09

Alexander

Related questions
                            
                                How to reverse engineer data models from an existing database in Python and SQL-Alchemy [duplicate]
                            
                                How to copy instances of a custom defined class in Python 3.3?
                            
                                Using Django's collectstatic with boto S3 throws "Error 32: Broken Pipe" after a while
                            
                                using materialized views or alternatives in django
                            
                                QTableWidget Current Selection Change Signal
                            
                                running a python package after compiling and uploading to pypicloud server
                            
                                Python 3: AttributeError: 'module' object has no attribute '__path__' using urllib in terminal
                            
                                Is it possible to anchor a matplotlib annotation to a data coordinate in the x-axis, but to a relative location in the y-axis?
                            
                                Why does `mylist[:] = reversed(mylist)` work?
                            
                                how to simplify use of pathlib objects to work with functions looking for strings
                            
                                Argparse: How to accept any number of optional arguments (starting with `-` or `--`)
                            
                                Cython: templates in python class wrappers
                            
                                Subprocess on remote server
                            
                                How to get a tuple out of a generator? Best Practice
                            
                                Execute coroutine from `call_soon` callback function
                            
                                Set two matplotlib imshow plots to have the same color map scale
                            
                                How to pickle and unpickle
                            
                                Windows missing Python.h
                            
                                Get data from <script> tag in HTML using Scrapy
                            
                                Simple Python server to process GET and POST requests with JSON

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With