I have a pandas dataframe as follows:
time winner loser stat
1 A B 0
2 C B 0
3 D B 1
4 E B 0
5 F A 0
6 G A 0
7 H A 0
8 I A 1
each row is a match result. the first column is the time of the match, second and third column contain winner/loser and the fourth column is one stat from the match.
I want to detect streaks of zeros for this stat per loser.
The expected result should look like this:
time winner loser stat streak
1 A B 0 1
2 C B 0 2
3 D B 1 0
4 E B 0 1
5 F A 0 1
6 G A 0 2
7 H A 0 3
8 I A 1 0
In pseudocode the algorithm should work like this:
.groupby
loser
column. loser
groupstat
column: if it contains 0
, then increment the streak
value from the previous row by 0
. if it is not 0
, then start a new streak
, that is, put 0
into the streak
column.So the .groupby
is clear. But then I would need some sort of .apply
where I can look at the previous row? this is where I am stuck.
Set value of display. max_rows to None and pass it to set_option and this will display all rows from the data frame.
To get the number of rows, and columns we can use len(df. axes[]) function in Python.
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
You can apply
custom function f
, then cumsum
, cumcount
and astype
:
def f(x):
x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() +
( (x['stat'] != 0).cumsum() == 0).astype(int)
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat streak
0 1 A B 0 1
1 2 C B 0 2
2 3 D B 1 0
3 4 E B 0 1
4 5 F A 0 1
5 6 G A 0 2
6 7 H A 0 3
7 8 I A 1 0
For better undestanding:
def f(x):
x['c'] = (x['stat'] != 0).cumsum()
x['a'] = (x['c'] == 0).astype(int)
x['b'] = x.groupby( 'c' ).cumcount()
x['streak'] = x.groupby( 'c' ).cumcount() + x['a']
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat c a b streak
0 1 A B 0 0 1 0 1
1 2 C B 0 0 1 1 2
2 3 D B 1 1 0 0 0
3 4 E B 0 1 0 1 1
4 5 F A 0 0 1 0 1
5 6 G A 0 0 1 1 2
6 7 H A 0 0 1 2 3
7 8 I A 1 1 0 0 0
Not as elegant as jezrael's answer, but for me easier to understand...
First, define a function that works with a single loser:
def f(df):
df['streak2'] = (df['stat'] == 0).cumsum()
df['cumsum'] = np.nan
df.loc[df['stat'] == 1, 'cumsum'] = df['streak2']
df['cumsum'] = df['cumsum'].fillna(method='ffill')
df['cumsum'] = df['cumsum'].fillna(0)
df['streak'] = df['streak2'] - df['cumsum']
df.drop(['streak2', 'cumsum'], axis=1, inplace=True)
return df
The streak is essentially a cumsum
, but we need to reset it each time stat
is 1. We therefore subtract the value of the cumsum
where stat
is 1, carried forward until the next 1.
Then groupby
and apply
by loser:
df.groupby('loser').apply(f)
The result is as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With