Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas / Python - Very slow performance using stack() groupby() and apply()

I am trying to create a new column in a dataframe based on pairs of information and its previous values. Although the code that I run is correct, and gives the results I need, it is very slow when I run it on a large dataframe. So I susspect I am not using all of the Python power for this task. Is there a more efficient and faster way of doing this in Python?.

To put you in context, let me explain to you a little about what I am looking for:

I have a dataframe, which describes competitions results, where for each 'date' you can see the 'type' who competed and its score called 'xx'.

What my code does is to obtain the difference of score 'xx' between 'type' for each 'date' and then get the sum of difference of the results of the previous competitions that all the types competing with each other had in the past ('win_comp_past_difs').

Below you can see the data and the model with its output.

## I. DATA AND MODEL ##

I.1. Data

import pandas as pd
import numpy as np

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')

Which looks like this:

                  xx
date       type
2018-01-01 A     1.0
           B     5.0
2018-02-01 B     3.0
2018-03-01 A     2.0
           B     7.0
           C     3.0
           D     1.0
           E     6.0
2018-05-01 B     3.0
2018-06-01 A     5.0
           B     2.0
           C     3.0
2018-07-01 A     1.0
2018-08-01 B     9.0
           C     3.0
2018-09-01 A     2.0
           B     7.0
2018-10-01 C     3.0
           A     6.0
           B     8.0
2018-11-01 A     2.0
2018-12-01 B     7.0
           C     9.0

I.2. Model (very slow in a large dataframe)

# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()

# group by players
groups = new_df.groupby(level=[1,2])

# sum function
def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])

Below you can see how the output of the model looks like:

                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Just in case it is difficult for you to understand what does the User-Defined function (def) do, let me explain it to you below.

For this porpouse I will work with one group of the groupby of the dataframe.

Below you will see an explanation of how the User-Defines function work.

## II. EXPLANATION OF THE USER-DEFINED FUNCTION ##

So, for you to see how the User-defined function work let me select an specific group of the groupby.

II.1 Choosing an specific group

gb = df.groupby('date')
gb2 = gb.get_group((list(gb.groups)[2]))

Which looks like this:

                    xx
  date       type
  2018-03-01 A     2.0
             B     7.0
             C     3.0
             D     1.0
             E     6.0

II.2 Creating a list of competitors (teams)'

teams = gb2.index.get_level_values(1)

II.3 Creating a dataframe of the difference of 'xx' between 'type'

df_comp= pd.DataFrame(gb2.xx[:,None]-gb2.xx[None,:],columns = teams.values,index=teams.values)

Which looks like this:

    A    B    C    D    E
  A  0.0 -5.0 -1.0  1.0 -4.0
  B  5.0  0.0  4.0  6.0  1.0
  C  1.0 -4.0  0.0  2.0 -3.0
  D -1.0 -6.0 -2.0  0.0 -5.0
  E  4.0 -1.0  3.0  5.0  0.0

From this point I use the stack() function as an intermediate step to go back to the original dataframe. The rest you can follow it in the I. DATA AND MODEL.

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

like image 449
Mario Arend Avatar asked Feb 12 '20 18:02

Mario Arend


People also ask

How do you make apply faster in pandas?

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.

Is apply pandas slow?

That being said, Pandas seems to have one big drawback, it's SLOW. Or at least that is how it seems, while doing my first steps into data science I found Pandas to be very useful for data exploration but almost unusable for heavy computational tasks.

Is groupby fast?

The GROUP BY clause is faster than the SELECT DISTINCT clause (at least in these tests) because it does not require writing to a temporary file on disk.

What does the function groupby () in the pandas library accomplish?

One of the most frequently used Pandas functions for data analysis is the groupby function. It allows for grouping data points (i.e. rows) based on the distinct values in a column or a set of columns. After the groups are generated, you can easily apply aggregation functions to a numerical column.


3 Answers

I only modify the get_diff. The main points are moving stack to outside of get_diff and taking advantange of stack's feature that it drops NaN to avoid the filtering inside of get_diff.

The new get_diff_s uses np.fill to fill all diagonal values to NaN and return a dataframe instead of filtered series.

def get_diff_s(x):
    teams = x.index.get_level_values(1)
    arr = x[:,None]-x[None,:]
    np.fill_diagonal(arr, np.nan)    
    return pd.DataFrame(arr,columns = teams.values,index=teams.values)

df['win_comp_past_difs'] = (df.groupby('date').xx.apply(get_diff_s)
                              .groupby(level=1).cumsum().stack()
                              .groupby(level=[1,2]).shift().sum(level=[0, 1]))

Out[1348]:
                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Timing:

Original solution: (I chained all your commands into one-liner)

In [1352]: %timeit df.groupby('date').xx.apply(get_diff).groupby(level=[1,2]).a
      ...: pply(lambda x: x.cumsum().shift()).sum(level=[0,1])
82.9 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Revised solution:

In [1353]: %timeit df.groupby('date').xx.apply(get_diff_s).groupby(level=1).cum
      ...: sum().stack().groupby(level=[1,2]).shift().sum(level=[0,1])
47.1 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, on the sample data, it's about 40% faster. However, I don't know how it performs on your real dataset

like image 50
Andy L. Avatar answered Oct 15 '22 16:10

Andy L.


There is huge overhead for your many layers of indexes.

The best way to tackle this in my opinion is through paralleling the processing of each groupby in different threads. There are my threads on that here in SO, might be helpful.

As an alternative, you may reduce your indexing overhead by managing the indexes yourself.

f, s, t, d = [], [], [], []

for _, sub in df.groupby('date').xx:
  date = sub.index.get_level_values(0)
  i    = sub.index.get_level_values(1)
  tmp  = (sub.values[:, None] - sub.values).ravel()

  f.extend(np.repeat(i, len(i)))
  s.extend(np.tile(i, len(i)))
  t.extend(tmp)
  d.extend(np.repeat(date, len(i)))

Then filter and do your cumsum+sum stuff.

inter = pd.DataFrame({'i0': d, 'i1': f, 'i2': s, 'i3': t}).query('i1 != i2')
df['rf'] = inter.assign(v=inter.groupby(['i1','i2']).i3.apply(lambda s: s.cumsum().shift())).set_index(['i0', 'i1']).v.sum(level=[0,1])

The second block should run really quickly even for huge data frames. The heavy processing is in the groupby, which is why a map-reduce/multi processing approach could be super helpful.

The enhancement for manual index handling in this case is around ~5x faster

1 loop, best of 3: 3.5 s per loop
1 loop, best of 3: 738 ms per loop

The idea is to try to give you some directions on where to improve. The operations are independent, so it should be feasible to execute each iteration in a different thread. You can also consider numba.

like image 38
rafaelc Avatar answered Oct 15 '22 18:10

rafaelc


I formulate the problem as I understand it and would like to suggest a slightly different approach that uses the built-ins. Two variations where the second one uses half the memory and runs in about half the time:

timeit -r10 event_score6(games, scores)                        
21.3 µs ± 165 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)

timeit -r10 event_score(events, games, scores)                 
42.8 µs ± 210 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)
#
# Assume game data comes from a csv-file that contains reasonably clean data.
#
# We have a list of games each with a list of participating teams and the
# scores for each team in the game.
#
# For each of the pairs in the current game first calculate the sum of the
# differences in score from the previous competitions (win_comp_past_difs);
# include only the pairs in the current game.  Second update each pair in the
# current game with the difference in scores.
#
# Using a defaultdict keep track of the scores for each pair in each game and
# update this score as each game is played.
#
import csv
from collections import defaultdict
from itertools import groupby
from itertools import permutations
from itertools import combinations
from math import nan as NaN


def read_data(data_file):
    """Read and group games and scores by event date

    Sort the participants in each game. Returns header, events, games,
    scores.
    """
    header = ""
    events = []
    games = []
    scores = []
    with open(data_file, newline='') as fd:
        sample = fd.read(1024)
        dialect = csv.Sniffer().sniff(sample)
        fd.seek(0)
        reader = csv.reader(fd, dialect)
        if csv.Sniffer().has_header(sample):
            header = next(reader)
        for event_date, row in groupby(reader, key=lambda r: r[0]):
            _, gg, ss = tuple(zip(*row))
            events.append(event_date.strip())
            gms = (tuple(g.strip() for g in gg))
            scr = (tuple(float(s) for s in ss))
            g, s = zip(*sorted(zip(gms, scr)))
            games.append(g)
            scores.append(s)
    return header, events, games, scores


def event_score(events, games, scores, wd=defaultdict(float)):
    """Score each event and calculare win_comp_past_difs iteratively

    Return the acuumulated state from all events and the
    win_comp_past_difs grouped by event.
    """
    wins = []
    for evnt, game, xx in zip(events, games, scores):
        evnt_wins = []
        if len(game) == 1:
            win_comp_past_difs = NaN
            evnt_wins.append(win_comp_past_difs)
            wins.append(evnt_wins)
            continue

        # Pairs and difference generator for current game.
        pairs = list(permutations(game, 2))
        dgen = (value[0] - value[1] for value in permutations(xx, 2))

        # Sum of differences from previous games including only pair of teams
        # in the current game.
        for team, result in zip(game, xx):
            win_comp_past_difs = sum(wd[key]
                                     for key in pairs if key[0] == team)
            evnt_wins.append(win_comp_past_difs)
        wins.append(evnt_wins)

        # Update pair differeces for current game.
        for pair, diff in zip(pairs, dgen):
            wd[pair] += diff
    return wd, wins


def event_score6(games, scores, wd=defaultdict(float)):
    """Score each game and calculare win_comp_past_difs iteratively

    Assume sorted order in each game. Return the acuumulated state from
    all events and the win_comp_past_difs grouped by event.
    """
    wins = []
    for game, xx in zip(games, scores):
        if len(game) == 1:
            wins.append((NaN,))
            continue

        # Pairs for current game.
        pairs = tuple(combinations(game, 2))

        # Sum of differences from previous games including
        # only pair of teams in the current game.
        win_comp_past_difs = defaultdict(float)
        for pair in pairs:
            tmp = wd[pair]
            win_comp_past_difs[pair[0]] += tmp
            win_comp_past_difs[pair[1]] -= tmp
        wins.append(tuple(win_comp_past_difs.values()))

        # Update pair differeces for current game.
        for pair, value in zip(pairs, combinations(xx, 2)):
            wd[pair] += value[0] - value[1]
    return wd, wins


h, events, games, scores = read_data('data2.csv')

wd, wins = event_score(events, games, scores)
wd6, wins6 = event_score6(games, scores)

print(h)
print("Elements ", len(wd))
for evnt, gm, sc, wns in zip(events, games, scores, wins):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

print(h)
print("Elements ", len(wd6))
for evnt, gm, sc, wns in zip(events, games, scores, wins6):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

A run of the code gives:

['Event', 'Team', 'Score']
Elements  20
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0
['Event', 'Team', 'Score']
Elements  10
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0

Using the file data2.csv

Event, Team, Score
Jan-18, A, 1
Jan-18, B, 5
Feb-18, B, 3
Mar-18, A, 2
Mar-18, B, 7
Mar-18, C, 3
Mar-18, D, 1
Mar-18, E, 6
May-18, B, 3
Jun-18, A, 5
Jun-18, B, 2
Jun-18, C, 3
Jul-18, A, 1
Aug-18, B, 9
Aug-18, C, 3
Sep-18, A, 2
Sep-18, B, 7
Oct-18, C, 3
Oct-18, A, 6
Oct-18, B, 8
Nov-18, A, 2
Dec-18, B, 7
Dec-18, C, 9
like image 34
FredrikHedman Avatar answered Oct 15 '22 17:10

FredrikHedman