Pandas / Python - Very slow performance using stack() groupby() and apply()

Tags:

I am trying to create a new column in a dataframe based on pairs of information and its previous values. Although the code that I run is correct, and gives the results I need, it is very slow when I run it on a large dataframe. So I susspect I am not using all of the Python power for this task. Is there a more efficient and faster way of doing this in Python?.

To put you in context, let me explain to you a little about what I am looking for:

I have a dataframe, which describes competitions results, where for each 'date' you can see the 'type' who competed and its score called 'xx'.

What my code does is to obtain the difference of score 'xx' between 'type' for each 'date' and then get the sum of difference of the results of the previous competitions that all the types competing with each other had in the past ('win_comp_past_difs').

Below you can see the data and the model with its output.

## I. DATA AND MODEL ##

I.1. Data

import pandas as pd
import numpy as np

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')

Which looks like this:

                  xx
date       type
2018-01-01 A     1.0
           B     5.0
2018-02-01 B     3.0
2018-03-01 A     2.0
           B     7.0
           C     3.0
           D     1.0
           E     6.0
2018-05-01 B     3.0
2018-06-01 A     5.0
           B     2.0
           C     3.0
2018-07-01 A     1.0
2018-08-01 B     9.0
           C     3.0
2018-09-01 A     2.0
           B     7.0
2018-10-01 C     3.0
           A     6.0
           B     8.0
2018-11-01 A     2.0
2018-12-01 B     7.0
           C     9.0

I.2. Model (very slow in a large dataframe)

# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()

# group by players
groups = new_df.groupby(level=[1,2])

# sum function
def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])

Below you can see how the output of the model looks like:

                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Just in case it is difficult for you to understand what does the User-Defined function (def) do, let me explain it to you below.

For this porpouse I will work with one group of the groupby of the dataframe.

Below you will see an explanation of how the User-Defines function work.

## II. EXPLANATION OF THE USER-DEFINED FUNCTION ##

So, for you to see how the User-defined function work let me select an specific group of the groupby.

II.1 Choosing an specific group

gb = df.groupby('date')
gb2 = gb.get_group((list(gb.groups)[2]))

Which looks like this:

                    xx
  date       type
  2018-03-01 A     2.0
             B     7.0
             C     3.0
             D     1.0
             E     6.0

II.2 Creating a list of competitors (teams)'

teams = gb2.index.get_level_values(1)

II.3 Creating a dataframe of the difference of 'xx' between 'type'

df_comp= pd.DataFrame(gb2.xx[:,None]-gb2.xx[None,:],columns = teams.values,index=teams.values)

Which looks like this:

    A    B    C    D    E
  A  0.0 -5.0 -1.0  1.0 -4.0
  B  5.0  0.0  4.0  6.0  1.0
  C  1.0 -4.0  0.0  2.0 -3.0
  D -1.0 -6.0 -2.0  0.0 -5.0
  E  4.0 -1.0  3.0  5.0  0.0

From this point I use the stack() function as an intermediate step to go back to the original dataframe. The rest you can follow it in the I. DATA AND MODEL.

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

449

asked Feb 12 '20 18:02

Mario Arend

3 Answers

I only modify the get_diff. The main points are moving stack to outside of get_diff and taking advantange of stack's feature that it drops NaN to avoid the filtering inside of get_diff.

The new get_diff_s uses np.fill to fill all diagonal values to NaN and return a dataframe instead of filtered series.

def get_diff_s(x):
    teams = x.index.get_level_values(1)
    arr = x[:,None]-x[None,:]
    np.fill_diagonal(arr, np.nan)    
    return pd.DataFrame(arr,columns = teams.values,index=teams.values)

df['win_comp_past_difs'] = (df.groupby('date').xx.apply(get_diff_s)
                              .groupby(level=1).cumsum().stack()
                              .groupby(level=[1,2]).shift().sum(level=[0, 1]))

Out[1348]:
                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Timing:

Original solution: (I chained all your commands into one-liner)

In [1352]: %timeit df.groupby('date').xx.apply(get_diff).groupby(level=[1,2]).a
      ...: pply(lambda x: x.cumsum().shift()).sum(level=[0,1])
82.9 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Revised solution:

In [1353]: %timeit df.groupby('date').xx.apply(get_diff_s).groupby(level=1).cum
      ...: sum().stack().groupby(level=[1,2]).shift().sum(level=[0,1])
47.1 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, on the sample data, it's about 40% faster. However, I don't know how it performs on your real dataset

answered Oct 15 '22 16:10

Andy L.

There is huge overhead for your many layers of indexes.

The best way to tackle this in my opinion is through paralleling the processing of each groupby in different threads. There are my threads on that here in SO, might be helpful.

As an alternative, you may reduce your indexing overhead by managing the indexes yourself.

f, s, t, d = [], [], [], []

for _, sub in df.groupby('date').xx:
  date = sub.index.get_level_values(0)
  i    = sub.index.get_level_values(1)
  tmp  = (sub.values[:, None] - sub.values).ravel()

  f.extend(np.repeat(i, len(i)))
  s.extend(np.tile(i, len(i)))
  t.extend(tmp)
  d.extend(np.repeat(date, len(i)))

Then filter and do your cumsum+sum stuff.

inter = pd.DataFrame({'i0': d, 'i1': f, 'i2': s, 'i3': t}).query('i1 != i2')
df['rf'] = inter.assign(v=inter.groupby(['i1','i2']).i3.apply(lambda s: s.cumsum().shift())).set_index(['i0', 'i1']).v.sum(level=[0,1])

The second block should run really quickly even for huge data frames. The heavy processing is in the groupby, which is why a map-reduce/multi processing approach could be super helpful.

The enhancement for manual index handling in this case is around ~5x faster

1 loop, best of 3: 3.5 s per loop
1 loop, best of 3: 738 ms per loop

The idea is to try to give you some directions on where to improve. The operations are independent, so it should be feasible to execute each iteration in a different thread. You can also consider numba.

answered Oct 15 '22 18:10

rafaelc

I formulate the problem as I understand it and would like to suggest a slightly different approach that uses the built-ins. Two variations where the second one uses half the memory and runs in about half the time:

timeit -r10 event_score6(games, scores)                        
21.3 µs ± 165 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)

timeit -r10 event_score(events, games, scores)                 
42.8 µs ± 210 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)

#
# Assume game data comes from a csv-file that contains reasonably clean data.
#
# We have a list of games each with a list of participating teams and the
# scores for each team in the game.
#
# For each of the pairs in the current game first calculate the sum of the
# differences in score from the previous competitions (win_comp_past_difs);
# include only the pairs in the current game.  Second update each pair in the
# current game with the difference in scores.
#
# Using a defaultdict keep track of the scores for each pair in each game and
# update this score as each game is played.
#
import csv
from collections import defaultdict
from itertools import groupby
from itertools import permutations
from itertools import combinations
from math import nan as NaN


def read_data(data_file):
    """Read and group games and scores by event date

    Sort the participants in each game. Returns header, events, games,
    scores.
    """
    header = ""
    events = []
    games = []
    scores = []
    with open(data_file, newline='') as fd:
        sample = fd.read(1024)
        dialect = csv.Sniffer().sniff(sample)
        fd.seek(0)
        reader = csv.reader(fd, dialect)
        if csv.Sniffer().has_header(sample):
            header = next(reader)
        for event_date, row in groupby(reader, key=lambda r: r[0]):
            _, gg, ss = tuple(zip(*row))
            events.append(event_date.strip())
            gms = (tuple(g.strip() for g in gg))
            scr = (tuple(float(s) for s in ss))
            g, s = zip(*sorted(zip(gms, scr)))
            games.append(g)
            scores.append(s)
    return header, events, games, scores


def event_score(events, games, scores, wd=defaultdict(float)):
    """Score each event and calculare win_comp_past_difs iteratively

    Return the acuumulated state from all events and the
    win_comp_past_difs grouped by event.
    """
    wins = []
    for evnt, game, xx in zip(events, games, scores):
        evnt_wins = []
        if len(game) == 1:
            win_comp_past_difs = NaN
            evnt_wins.append(win_comp_past_difs)
            wins.append(evnt_wins)
            continue

        # Pairs and difference generator for current game.
        pairs = list(permutations(game, 2))
        dgen = (value[0] - value[1] for value in permutations(xx, 2))

        # Sum of differences from previous games including only pair of teams
        # in the current game.
        for team, result in zip(game, xx):
            win_comp_past_difs = sum(wd[key]
                                     for key in pairs if key[0] == team)
            evnt_wins.append(win_comp_past_difs)
        wins.append(evnt_wins)

        # Update pair differeces for current game.
        for pair, diff in zip(pairs, dgen):
            wd[pair] += diff
    return wd, wins


def event_score6(games, scores, wd=defaultdict(float)):
    """Score each game and calculare win_comp_past_difs iteratively

    Assume sorted order in each game. Return the acuumulated state from
    all events and the win_comp_past_difs grouped by event.
    """
    wins = []
    for game, xx in zip(games, scores):
        if len(game) == 1:
            wins.append((NaN,))
            continue

        # Pairs for current game.
        pairs = tuple(combinations(game, 2))

        # Sum of differences from previous games including
        # only pair of teams in the current game.
        win_comp_past_difs = defaultdict(float)
        for pair in pairs:
            tmp = wd[pair]
            win_comp_past_difs[pair[0]] += tmp
            win_comp_past_difs[pair[1]] -= tmp
        wins.append(tuple(win_comp_past_difs.values()))

        # Update pair differeces for current game.
        for pair, value in zip(pairs, combinations(xx, 2)):
            wd[pair] += value[0] - value[1]
    return wd, wins


h, events, games, scores = read_data('data2.csv')

wd, wins = event_score(events, games, scores)
wd6, wins6 = event_score6(games, scores)

print(h)
print("Elements ", len(wd))
for evnt, gm, sc, wns in zip(events, games, scores, wins):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

print(h)
print("Elements ", len(wd6))
for evnt, gm, sc, wns in zip(events, games, scores, wins6):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

A run of the code gives:

['Event', 'Team', 'Score']
Elements  20
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0
['Event', 'Team', 'Score']
Elements  10
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0

Using the file data2.csv

Event, Team, Score
Jan-18, A, 1
Jan-18, B, 5
Feb-18, B, 3
Mar-18, A, 2
Mar-18, B, 7
Mar-18, C, 3
Mar-18, D, 1
Mar-18, E, 6
May-18, B, 3
Jun-18, A, 5
Jun-18, B, 2
Jun-18, C, 3
Jul-18, A, 1
Aug-18, B, 9
Aug-18, C, 3
Sep-18, A, 2
Sep-18, B, 7
Oct-18, C, 3
Oct-18, A, 6
Oct-18, B, 8
Nov-18, A, 2
Dec-18, B, 7
Dec-18, C, 9

answered Oct 15 '22 17:10

FredrikHedman

Related questions
                            
                                How to setup Visual Studio Code stdin/stdout redirection for Python (debugger)?
                            
                                Get pandas.read_csv to read empty fields as NaN, and empty strings as empty strings
                            
                                tf.cast equivalent in pytorch?
                            
                                Why do very large Fibonacci numbers create an ellipse-type shape?
                            
                                Annoting date on chart
                            
                                TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match
                            
                                How to change the font-size of text in dataframe using pandas
                            
                                Change a way fixtures are called in pytest
                            
                                What is different approach to my problem?
                            
                                How exclude !Ref tag from check-yaml git hook?
                            
                                Upgrading SQLite in Colab
                            
                                Altair bar chart with bars of variable width?
                            
                                How to initiate a Chromium based Vivaldi browser session using Selenium and Python
                            
                                Python moving file on SFTP server to another folder [duplicate]
                            
                                Telegram bot API is the chat_id unique for each user contacting the bot?
                            
                                VotingClassifier with pipelines as estimators
                            
                                Creating a button in python
                            
                                Conditional word frequency count in Pandas
                            
                                Creating pixel noise with PIL Python
                            
                                MAD results differ in pandas, scipy, and numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas / Python - Very slow performance using stack() groupby() and apply()

Tags:

performance

python

pandas

cython

numba

Mario Arend

People also ask

3 Answers

Andy L.

rafaelc

FredrikHedman

Recent Activity

Donate For Us