Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select all DataFrame rows that are within a certain distance of a given value in a specific column?

Here is an example DataFrame which I will use to better illustrate my question:

import pandas as pd

df = pd.DataFrame(pd.np.random.rand(30, 3), columns=tuple('ABC'))
df['event'] = pd.np.nan
df.loc[10, 'event'] = 'ping'
df.loc[20, 'event'] = 'ping'
df.loc[19, 'event'] = 'pong'

I need to create windows of n rows centered around each occurrence of ping.

In other words, let i be the index of a row that contains ping in the event column. For each i, I want to select df.ix[i-n:i+n].

Thus, for n=3, I would expect the following result:

             A          B          C event
7    0.8295863  0.2162861  0.4856461   NaN
8     0.156646  0.4730667  0.9968878   NaN
9    0.6709413  0.4796197  0.8747416   NaN
10  0.09942329   0.154008  0.5761598  ping
11   0.7168143   0.678207  0.7281105   NaN
12   0.8915475  0.8013187  0.9049722   NaN
13   0.9545411  0.4844835  0.1645746   NaN
17   0.9909208  0.1091025  0.6582635   NaN
18   0.2536326  0.4324749  0.8001643   NaN
19   0.4734659  0.5582809  0.1221296  pong
20   0.7230407  0.6695843  0.3902591  ping
21   0.3624909  0.2685049  0.5484445   NaN
22  0.05626284  0.6113877  0.9131929   NaN
23   0.8312294  0.5694373  0.4325798   NaN

[14 rows x 4 columns]

A few caveats:

  1. I'm looking for a non-iterative solution.
  2. Note that there is a pong value around which we do not want to center a window. It is captured in the result of centering around the second ping, however.

How can this be achieved?

like image 898
Louis Thibault Avatar asked Feb 12 '23 13:02

Louis Thibault


1 Answers

In [17]: n = 3

Select an indexer that is the range of what you need, e.g. the target index +- 3 (subject to the max/min of the size of the frame). Concatenate them all, and eliminate dups.

In [18]: indexers = np.unique(np.concatenate([ np.arange(max(i-n,0),min(i+n,len(df))) for i in df[df.event=='ping'].index ]))

In [19]: indexers
Out[19]: array([ 7,  8,  9, 10, 11, 12, 17, 18, 19, 20, 21, 22])

Select them.

In [20]: df.iloc[indexers]
Out[20]: 
             A           B          C event
7   0.03348742  0.05735324  0.1220022   NaN
8    0.9567363   0.6539097  0.8409577   NaN
9    0.3115902   0.4955503  0.1749197   NaN
10   0.6883777   0.6185107  0.7933182  ping
11   0.5185129   0.6533616  0.1569159   NaN
12   0.1196976   0.9638604  0.7318006   NaN
17  0.02897615   0.1224485  0.5706852   NaN
18  0.02409971   0.4715463  0.4587161   NaN
19   0.9070592   0.3371241  0.9543977  pong
20   0.8533369   0.7549413  0.5334882  ping
21   0.9546738   0.8203931  0.8543028   NaN
22  0.05691086   0.2402766  0.3922318   NaN

Note that you might need to do a df.reset_index() (before you select to get the actual row index position, rather than a value).

Note that their is a bug here as the setting of the 'event' column converts everything to object, see here. You can alleviate by using df.convert_objects().

like image 158
Jeff Avatar answered Apr 13 '23 00:04

Jeff