Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative apply within window defined by other columns

Tags:

python

pandas

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
    columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val
0   1      3     100
1   2      4     200
2   3      6     300
3   4      6     400
4   5      6     500

The result I am after is:

1   100
2   300
3   500
4   700
5   1200
like image 979
rinspy Avatar asked Nov 12 '18 15:11

rinspy


1 Answers

numba

from numba import njit

@njit
def pir_numba(S, F, V):
  mn = S.min()
  mx = F.max()
  out = np.zeros(mx)
  for s, f, v in zip(S, F, V):
    out[s:f] += v
  return out[mn:]

pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

np.bincount

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))

array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({
    (k, i): v
    for i, s, f, v in df.itertuples()
    for k in range(s, f)
}).sum(level=0)

1     100
2     300
3     500
4     700
5    1200
dtype: int64

With no dependence on index

pd.Series({
    (k, i): v
    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
    for k in range(s, f)
}).sum(level=0)
like image 59
piRSquared Avatar answered Oct 09 '22 04:10

piRSquared