How to get count of values greater than current row in the last n rows? Imagine we have a dataframe as following: <pre class="prettyprint"><code> col_a 0 8.4 1 11.3 2 7.2 3 6.5 4 4.5 5 8.9 </code></pre> I am trying to get a table such as following where n=3. <pre class="prettyprint"><code> col_a col_b 0 8.4 0 1 11.3 0 2 7.2 2 3 6.5 3 4 4.5 3 5 8.9 0 </code></pre> Thanks in advance.

In pandas is best dont loop because slow, here is better use <code>rolling</code> with custom function: <pre class="prettyprint"><code>n = 3 df['new'] = (df['col_a'].rolling(n+1, min_periods=1) .apply(lambda x: (x[-1] < x[:-1]).sum()) .astype(int)) print (df) col_a new 0 8.4 0 1 11.3 0 2 7.2 2 3 6.5 3 4 4.5 3 5 8.9 0 </code></pre> If performance is important, use strides: <pre class="prettyprint"><code>n = 3 x = np.concatenate([[np.nan] * (n), df['col_a'].values]) def rolling_window(a, window): shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) strides = a.strides + (a.strides[-1],) return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) arr = rolling_window(x, n + 1) df['new'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1) print (df) col_a new 0 8.4 0 1 11.3 0 2 7.2 2 3 6.5 3 4 4.5 3 5 8.9 0 </code></pre> Performance: Here is used <code>perfplot</code> in small window <code>n = 3</code>: <img src="https://i.stack.imgur.com/BpkDU.png" alt="g1"> <pre class="prettyprint"><code>np.random.seed(1256) n = 3 def rolling_window(a, window): shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) strides = a.strides + (a.strides[-1],) return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) def roll(df): df['new'] = (df['col_a'].rolling(n+1, min_periods=1).apply(lambda x: (x[-1] < x[:-1]).sum(), raw=True).astype(int)) return df def list_comp(df): df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() for i, j in df['col_a'].items()] return df def strides(df): x = np.concatenate([[np.nan] * (n), df['col_a'].values]) arr = rolling_window(x, n + 1) df['new1'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1) return df def make_df(n): df = pd.DataFrame(np.random.randint(20, size=n), columns=['col_a']) return df perfplot.show( setup=make_df, kernels=[list_comp, roll, strides], n_range=[2**k for k in range(2, 15)], logx=True, logy=True, xlabel='len(df)') </code></pre> Also I was curious about performance in large window, <code>n = 100</code>: <img src="https://i.stack.imgur.com/qjp1j.png" alt="g2">

Pandas count values greater than current row in the last n rows

I am trying to get a table such as following where n=3.

    col_a   col_b
0     8.4       0
1    11.3       0
2     7.2       2
3     6.5       3
4     4.5       3
5     8.9       0

Thanks in advance.

811

asked Jun 26 '18 09:06

koray1396

1 Answers

In pandas is best dont loop because slow, here is better use rolling with custom function:

n = 3
df['new'] = (df['col_a'].rolling(n+1, min_periods=1)
                        .apply(lambda x: (x[-1] < x[:-1]).sum())
                        .astype(int))
print (df)
   col_a  new
0    8.4    0
1   11.3    0
2    7.2    2
3    6.5    3
4    4.5    3
5    8.9    0

If performance is important, use strides:

n = 3
x = np.concatenate([[np.nan] * (n), df['col_a'].values])

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = rolling_window(x, n + 1)

df['new'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
print (df)
   col_a  new
0    8.4    0
1   11.3    0
2    7.2    2
3    6.5    3
4    4.5    3
5    8.9    0

Performance: Here is used perfplot in small window n = 3:

np.random.seed(1256)
n = 3

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def roll(df):
    df['new'] = (df['col_a'].rolling(n+1, min_periods=1).apply(lambda x: (x[-1] < x[:-1]).sum(), raw=True).astype(int))
    return df

def list_comp(df):
    df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() for i, j in df['col_a'].items()]
    return df

def strides(df):
    x = np.concatenate([[np.nan] * (n), df['col_a'].values])
    arr = rolling_window(x, n + 1)
    df['new1'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
    return df


def make_df(n):
    df = pd.DataFrame(np.random.randint(20, size=n), columns=['col_a'])
    return df

perfplot.show(
    setup=make_df,
    kernels=[list_comp, roll, strides],
    n_range=[2**k for k in range(2, 15)],
    logx=True,
    logy=True,
    xlabel='len(df)')

Also I was curious about performance in large window, n = 100:

196

answered Oct 02 '22 23:10

jezrael

Related questions
                            
                                Sort Pandas Dataframe by substrings of a column
                            
                                Collecting prometheus metrics from a separate port using flask and gunicorn with multiple workers
                            
                                Flask-Restplus: how to model string or object?
                            
                                Cython: size attribute of memoryviews
                            
                                Python Selenium: Send keys is too slow
                            
                                How to install pygtk 3 on Mac OS X?
                            
                                What's the meaning of `f` and `m` in PyCharm auto-completion?
                            
                                Getting precision, recall and F1 score per class in Keras
                            
                                python turtle weird cursor jump
                            
                                confusion of annotating generator function as iterator
                            
                                Is there something like the threading macro from Clojure in Python?
                            
                                Portable application: s3 and Google cloud storage
                            
                                how to format float number in python? [duplicate]
                            
                                Request vs Requests module in Python
                            
                                How to join data from two tables in SQLAlchemy?
                            
                                Jupyter Notebook running servers list - ValueError: No JSON object could be decoded
                            
                                Send ERC20 token with web3.py using a local private key
                            
                                Invalid device Ordinal , CUDA / TORCH
                            
                                Django migrations - django.db.migrations.exceptions.NodeNotFoundError
                            
                                Which one is better for string reverse in Python 3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas count values greater than current row in the last n rows

Tags:

python

pandas

dataframe

koray1396

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us