Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize a loop through pandas series when values are used in slice of another series

Suppose I have two series of timestamps which are pairs of start/end times for various 5 hour ranges. They are not necessarily sequential, nor are they quantized to the hour.

import pandas as pd

start = pd.Series(pd.date_range('20190412',freq='H',periods=25))

# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)

# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')

end = start + pd.Timedelta('5H')

Now suppose that we have some data that is timestamped by minute, over a range that encompasses all start/end pairs.

data_series = pd.Series(data=np.random.randint(20, size=(75*60)), 
                        index=pd.date_range('20190411',freq='T',periods=(75*60)))

We wish to obtain the values from the data_series within the range of each start and end time. This can be done naively inside a loop

frm = []
for s,e in zip(start,end):
    frm.append(data_series.loc[s:e].values)

As we can see this naive approach loops over each pair of start and end dates, gets the values from data.

However this implementation is slow if len(start) is large. Is there a way to perform this sort of logic leveraging pandas vector functions?

I feel it is almost like I want to to apply .loc with a vector or pd.Series rather than a single pd.Timestamp?

EDIT

Using .apply is no more/marginally more efficient than using the naive for loop. I was hoping to be pointed in direction of a pure vector solution

like image 411
mch56 Avatar asked Apr 12 '19 12:04

mch56


People also ask

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is .apply faster than for loop Python?

apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).

Can a pandas series object hold data of different types?

In the same way you can't attach a specific data type to list , even if all elements are of the same type, a Pandas object series contains pointers to any number of types.

What is the best way to iterate through a DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.


1 Answers

Basic Idea

As usual pandas would spend time on searching for that one specific index at data_series.loc[s:e], where s and e are datetime indices. That's costly when looping and that's exactly where we would improve. We would find all those indices in a vectorized manner with searchsorted. Then, we would extract the values off data_series as an array and use those indices obtained from searchsorted with simple integer-based indexing. Thus, there would be a loop with minimal work of simple-slicing off an array.

General mantra being - Do most work with pre-processing in a vectorized manner and minimal when looping.

The implementation would look something like this -

def select_slices_by_index(data_series, start, end):
    idx = data_series.index.values
    S = np.searchsorted(idx,start.values)
    E = np.searchsorted(idx,end.values)
    ar = data_series.values
    return [ar[i:j] for (i,j) in zip(S,E+1)]

Use NumPy-striding

For the specific case when the time-period between starts and ends are same for all entries and all slices are covered by that length, i.e. no out-of-bounds cases, we can use NumPy's sliding window trick.

We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.

from skimage.util.shape import view_as_windows

def select_slices_by_index_strided(data_series, start, end):
    idx = data_series.index.values
    L = np.searchsorted(idx,end.values[0])-np.searchsorted(idx,start.values[0])+1
    S = np.searchsorted(idx,start.values)
    ar = data_series.values
    w = view_as_windows(ar,L)
    return w[S]

Use this post if you don't have access to scikit-image.


Benchmarking

Let's scale up everything by 100x on the given sample data and test out.

Setup -

np.random.seed(0)
start = pd.Series(pd.date_range('20190412',freq='H',periods=2500))

# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)

# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')

end = start + pd.Timedelta('5H')
data_series = pd.Series(data=np.random.randint(20, size=(750*600)), 
                        index=pd.date_range('20190411',freq='T',periods=(750*600)))

Timings -

In [156]: %%timeit
     ...: frm = []
     ...: for s,e in zip(start,end):
     ...:     frm.append(data_series.loc[s:e].values)
1 loop, best of 3: 172 ms per loop

In [157]: %timeit select_slices_by_index(data_series, start, end)
1000 loops, best of 3: 1.23 ms per loop

In [158]: %timeit select_slices_by_index_strided(data_series, start, end)
1000 loops, best of 3: 994 µs per loop

In [161]: frm = []
     ...: for s,e in zip(start,end):
     ...:     frm.append(data_series.loc[s:e].values)

In [162]: np.allclose(select_slices_by_index(data_series, start, end),frm)
Out[162]: True

In [163]: np.allclose(select_slices_by_index_strided(data_series, start, end),frm)
Out[163]: True

140x+ and 170x speedups with these ones!

like image 70
Divakar Avatar answered Sep 28 '22 03:09

Divakar