Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently finding consecutive streaks in a pandas DataFrame column?

I have a DataFrame similar to the below:, and I want to add a Streak column to it (see example below):

Date         Home_Team    Away_Team    Winner      Streak

2005-08-06       A            G           A           0
2005-08-06       B            H           H           0
2005-08-06       C            I           C           0
2005-08-06       D            J           J           0
2005-08-06       E            K           K           0
2005-08-06       F            L           F           0
2005-08-13       A            B           A           1           
2005-08-13       C            D           D           1           
2005-08-13       E            F           F           0        
2005-08-13       G            H           H           0
2005-08-13       I            J           J           0
2005-08-13       K            L           K           1
2005-08-20       B            C           B           0
2005-08-20       A            D           A           2
2005-08-20       G            K           K           0
2005-08-20       I            E           E           0
2005-08-20       F            H           F           2
2005-08-20       J            L           J           2
2005-08-27       A            H           A           3
2005-08-27       B            F           B           1
2005-08-27       J            C           C           3           
2005-08-27       D            E           D           0
2005-08-27       I            K           K           0
2005-08-27       L            G           G           0
2005-09-05       B            A           A           2
2005-09-05       D            C           D           1
2005-09-05       F            E           F           0
2005-09-05       H            G           H           0
2005-09-05       J            I           I           0
2005-09-05       K            L           K           4

The DataFrame is approximately 200k rows going from 2005 to 2020.

Now, what I am trying to do is find the number of consecutive games the Home Team has won PRIOR to the date in in the Date column in the DataFrame. I have a solution, but it is too slow, see below:

df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)

How can I speed this up?

like image 632
the man Avatar asked Aug 31 '20 13:08

the man


People also ask

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do you count occurrences of pandas?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do I iterate through every row in a DataFrame?

Iterating over the rows of a DataFrame You can do so using either iterrows() or itertuples() built-in methods.


1 Answers

I will present a numpy-based solution here. Firstly because I am not very familiar with pandas and don't feel like doing the research, and secondly because a numpy solution should work just fine regardless.

Let's take a look at what happens to one given team first. Your goal is to find the number of consecutive wins for a team based on the sequence of games it participated in. I will drop the date column and turn your data into a numpy array for starters:

x = np.array([
    ['A', 'G', 'A'],
    ['B', 'H', 'H'],
    ['C', 'I', 'C'],
    ['D', 'J', 'J'],
    ['E', 'K', 'K'],
    ['F', 'L', 'F'],
    ['A', 'B', 'A'],
    ['C', 'D', 'D'],
    ['E', 'F', 'F'],
    ['G', 'H', 'H'],
    ['I', 'J', 'J'],
    ['K', 'L', 'K'],
    ['B', 'C', 'B'],
    ['A', 'D', 'A'],
    ['G', 'K', 'K'],
    ['I', 'E', 'E'],
    ['F', 'H', 'F'],
    ['J', 'L', 'J']])

You don't need the date because all you care about is who played, even if they did it multiple times in one day. So let's take a look at just team A:

A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'

A_played is an index array with the same number of elements as there are rows in x. A_won is a mask that has as many elements as np.count_nonzero(A_played); i.e., the number of games A participated in.

Finding the sizes of the streaks is a fairly well hashed out problem:

streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]

You compute the differences between each pair of indices where the value of the mask switches. The extra padding with False ensures that you know which way the mask is switching. What you are looking for is based on this computation but requires a bit more detail, since you want the cumulative sum, but reset after each run. You can do that by setting the value of the data to the negated run length immediately after the run:

wins = np.r_[0, A_won, 0]  # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks

Now you have a trimmable array whose cumulative sum can be assigned directly to the output columns:

streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)

# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]

# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]

Now you can loop over all teams (which should be a fairly small number compared to the total number of games):

def process_team(data, team, output):
    played = np.flatnonzero((data[:, :2] == team).any(axis=1))
    won = data[played, -1] == team
    wins = np.r_[0, won, 0]
    switch_indices = np.flatnonzero(np.diff(wins)) + 1
    streaks = np.diff(switch_indices)[::2]
    wins[switch_indices[1::2]] = -streaks
    streak_counts = np.cumsum(wins[:-2])

    home_mask = data[played, 0] == team
    away_mask = ~home_mask

    output[played[home_mask], 0] = streak_counts[home_mask]
    output[played[away_mask], 1] = streak_counts[away_mask]

output = np.empty((x.shape[0], 2), dtype=int)

# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
    process_team(x, team, output)
like image 120
Mad Physicist Avatar answered Oct 12 '22 01:10

Mad Physicist