Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Indexing makes Slice of pandas dataframe faster?

Tags:

python

pandas

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...

I need to effectively get the record during the specific period. The following naive way is very time consuming.

new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]

I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.

My question is

  1. Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
  2. If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?
like image 619
Light Yagmi Avatar asked Dec 02 '15 03:12

Light Yagmi


People also ask

What is the purpose of index attribute in data frame?

Definition and Usage The index property returns the index information of the DataFrame. The index information contains the labels of the rows. If the rows has NOT named indexes, the index property returns a RangeIndex object with the start, stop, and step values.

Why is Itertuples faster than Iterrows?

The reason iterrows() is slower than itertuples() is due to iterrows() doing a lot of type checks in the lifetime of its call.

How do I speed up Itertuples?

Itertuples convert the data frame to a list of tuples, then iterates through it, which makes it comparatively faster. Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow.


1 Answers

Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.

df = pd.DataFrame(np.random.randn(1000000, 3), 
                  columns=list('ABC'), 
                  index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))

Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):

%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)

%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)

%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)

%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)

Here are the timings with the Datetime index as a column:

df.reset_index(inplace=True)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)

All of the above indexing techniques produce the same dataframe:

>>> df2.shape
(250560, 3)

It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.

like image 64
Alexander Avatar answered Sep 18 '22 17:09

Alexander