Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Testing subsequent values in a DataFrame

Tags:

python

pandas

I have a DataFrame with one column with positive and negative integers. For each row, I'd like to see how many consecutive rows (starting with and including the current row) have negative values.

So if a sequence was 2, -1, -3, 1, -1, the result would be 0, 2, 1, 0, 1.

I can do this by iterating over all the indices, using .iloc to split the column, and next() to find out where the next positive value is. But I feel like this isn't taking advantage of panda's capabilities, and I imagine that there's a better way of doing it. I've experimented with using .shift() and expanding_window but without success.

Is there a more "pandastic" way of finding out how many consecutive rows after the current one meet some logical condition?

Here's what's working now:

import pandas as pd

df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1]})

df["b"] = 0
for i in df.index:
    sub = df.iloc[i:].a.tolist()
    df.b.iloc[i] = next((sub.index(n) for n in sub if n >= 0), 1)

Edit: I realize that even my own example doesn't work when there's more than one negative value at the end. So that makes a better solution even more necessary.

Edit 2: I stated the problem in terms of integers, but originally only put 1 and -1 in my example. I need to solve for positive and negative integers in general.

like image 740
ASGM Avatar asked Apr 07 '15 18:04

ASGM


2 Answers

FWIW, here's a fairly pandastic answer that requires no functions or applies. Borrows from here (among other answers I'm sure) and thanks to @DSM for mentioning the ascending=False option:

df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1, -2]})

df['pos'] = df.a > 0
df['grp'] = ( df['pos'] != df['pos'].shift()).cumsum()
dfg = df.groupby('grp')
df['c'] = np.where( df['a'] < 0, dfg.cumcount(ascending=False)+1, 0 )

   a  b    pos  grp  c
0  2  0   True    1  0
1 -1  3  False    2  3
2 -3  2  False    2  2
3 -1  1  False    2  1
4  1  0   True    3  0
5  1  0   True    3  0
6 -1  1  False    4  1
7  1  0   True    5  0
8 -1  1  False    6  2
9 -2  1  False    6  1

I think a nice thing about this method is that once you set up the 'grp' variable you can do lots of things very easily with standard groupby methods.

like image 56
JohnE Avatar answered Oct 06 '22 21:10

JohnE


This was an interesting puzzle. I found a way to do it using pandas tools, but I think you'll agree it's a lot more opaque :-). Here's the example:

data = pandas.Series([1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1])
x = data[::-1] # reverse the data

print(x.groupby(((x<0) != (x<0).shift()).cumsum()).apply(lambda x: pandas.Series(
    np.arange(len(x))+1 if (x<0).all() else np.zeros(len(x)),
    index=x.index))[::-1])

The output is correct:

0     0
1     3
2     2
3     1
4     0
5     2
6     1
7     0
8     0
9     1
10    0
dtype: float64

The basic idea is similar to what I described in my answer to this question, and you can find the same approach used in various answers that ask how to make use of inter-row information in pandas. Your question is slightly trickier because your criterion goes in reverse (asking for the number of following negatives rather than the number of preceding negatives), and because you only want one side of the grouping (i.e., you only want the number of consecutive negatives, not the number of consecutive numbers with the same sign).

Here is a more verbose version of the same code with some explanation that may make it easier to grasp:

def getNegativeCounts(x):
    # This function takes as input a sequence of numbers, all the same sign.
    # If they're negative, it returns an increasing count of how many there are.
    # If they're positive, it just returns the same number of zeros.
    # [-1, -2, -3] -> [1, 2, 3]
    # [1, 2, 3] -> [0, 0, 0]
    if (x<0).all():
        return pandas.Series(np.arange(len(x))+1, index=x.index)
    else:
        return pandas.Series(np.zeros(len(x)), index=x.index)

# we have to reverse the data because cumsum only works in the forward direction
x = data[::-1]

# compute for each number whether it has the same sign as the previous one
sameSignAsPrevious = (x<0) != (x<0).shift()
# cumsum this to get an "ID" for each block of consecutive same-sign numbers
sameSignBlocks = sameSignAsPrevious.cumsum()
# group on these block IDs
g = x.groupby(sameSignBlocks)
# for each block, apply getNegativeCounts
# this will either give us the running total of negatives in the block,
# or a stretch of zeros if the block was positive
# the [::-1] at the end reverses the result
# (to compensate for our reversing the data initially)
g.apply(getNegativeCounts)[::-1]

As you can see, run-length-style operations are not usually simple in pandas. There is, however, an open issue for adding more grouping/partitioning abilities that would ameliorate some of this. In any case, your particular use case has some specific quirks that make it a bit different from a typical run-length task.

like image 32
BrenBarn Avatar answered Oct 06 '22 22:10

BrenBarn