Suppose I have two series of timestamps which are pairs of start/end times for various 5 hour ranges. They are not necessarily sequential, nor are they quantized to the hour.
import pandas as pd
start = pd.Series(pd.date_range('20190412',freq='H',periods=25))
# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)
# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')
end = start + pd.Timedelta('5H')
Now suppose that we have some data that is timestamped by minute, over a range that encompasses all start/end pairs.
data_series = pd.Series(data=np.random.randint(20, size=(75*60)),
index=pd.date_range('20190411',freq='T',periods=(75*60)))
We wish to obtain the values from the data_series
within the range of each start
and end
time. This can be done naively inside a loop
frm = []
for s,e in zip(start,end):
frm.append(data_series.loc[s:e].values)
As we can see this naive approach loops over each pair of start
and end
dates, gets the values from data.
However this implementation is slow if len(start)
is large. Is there a way to perform this sort of logic leveraging pandas
vector functions?
I feel it is almost like I want to to apply .loc
with a vector or pd.Series
rather than a single pd.Timestamp
?
EDIT
Using .apply
is no more/marginally more efficient than using the naive for
loop. I was hoping to be pointed in direction of a pure vector solution
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
In the same way you can't attach a specific data type to list , even if all elements are of the same type, a Pandas object series contains pointers to any number of types.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
As usual pandas would spend time on searching for that one specific index at data_series.loc[s:e]
, where s
and e
are datetime indices. That's costly when looping and that's exactly where we would improve. We would find all those indices in a vectorized manner with searchsorted
. Then, we would extract the values off data_series
as an array and use those indices obtained from searchsorted
with simple integer-based indexing. Thus, there would be a loop with minimal work of simple-slicing off an array.
General mantra being - Do most work with pre-processing in a vectorized manner and minimal when looping.
The implementation would look something like this -
def select_slices_by_index(data_series, start, end):
idx = data_series.index.values
S = np.searchsorted(idx,start.values)
E = np.searchsorted(idx,end.values)
ar = data_series.values
return [ar[i:j] for (i,j) in zip(S,E+1)]
NumPy-striding
For the specific case when the time-period between starts
and ends
are same for all entries and all slices are covered by that length, i.e. no out-of-bounds cases, we can use NumPy's sliding window trick
.
We can leverage np.lib.stride_tricks.as_strided
based scikit-image's view_as_windows
to get sliding windows. More info on use of as_strided
based view_as_windows
.
from skimage.util.shape import view_as_windows
def select_slices_by_index_strided(data_series, start, end):
idx = data_series.index.values
L = np.searchsorted(idx,end.values[0])-np.searchsorted(idx,start.values[0])+1
S = np.searchsorted(idx,start.values)
ar = data_series.values
w = view_as_windows(ar,L)
return w[S]
Use this post
if you don't have access to scikit-image
.
Let's scale up everything by 100x
on the given sample data and test out.
Setup -
np.random.seed(0)
start = pd.Series(pd.date_range('20190412',freq='H',periods=2500))
# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)
# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')
end = start + pd.Timedelta('5H')
data_series = pd.Series(data=np.random.randint(20, size=(750*600)),
index=pd.date_range('20190411',freq='T',periods=(750*600)))
Timings -
In [156]: %%timeit
...: frm = []
...: for s,e in zip(start,end):
...: frm.append(data_series.loc[s:e].values)
1 loop, best of 3: 172 ms per loop
In [157]: %timeit select_slices_by_index(data_series, start, end)
1000 loops, best of 3: 1.23 ms per loop
In [158]: %timeit select_slices_by_index_strided(data_series, start, end)
1000 loops, best of 3: 994 µs per loop
In [161]: frm = []
...: for s,e in zip(start,end):
...: frm.append(data_series.loc[s:e].values)
In [162]: np.allclose(select_slices_by_index(data_series, start, end),frm)
Out[162]: True
In [163]: np.allclose(select_slices_by_index_strided(data_series, start, end),frm)
Out[163]: True
140x+
and 170x
speedups with these ones!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With