Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas nested groupby gives unexpected results

I am working on a problem where I am using a nested groupby.apply on a pandas DataFrame. During the first apply I am adding a column that I am using for the second inner groupby.apply. The combined result looks faulty to me. Can anyone explain to me why the below phenomen happens and how to reliably fix it?

Here is a minimal example:

import numpy as np
import pandas as pd

T = np.array( [
        [1,1,1],
        [1,1,1],
        [1,2,2],
        [1,2,2],
        [2,1,3],
        [2,1,3],
        [2,2,4],
        [2,2,4],
])

df = pd.DataFrame(T, columns= ['a','b','c' ])

print(df)


def foo2(x):
    return x

def foo(x):

    print("*" * 80 )

    # Add column d and groupby/apply on column 'd'
    x['d'] = [1, 1, 2, 2]
    x = x.groupby('d').apply(foo2)

    print(x)

    print("*" * 80)
    return x


# Apply first groupby/apply on column 'a'
df = df.groupby('a').apply( foo)

print("*"*80)
print("*"*80)

print(df)

When I run the above code on my Windows laptop I get the expected result

     a  b  c  d
a              
1 0  1  1  1  1
  1  1  1  1  1
  2  1  2  2  2
  3  1  2  2  2
2 4  2  1  3  1
  5  2  1  3  1
  6  2  2  4  2
  7  2  2  4  2

Running the same code on a Mac gives

     a  b  c  d
a              
1 0  1  1  1  1
  1  1  1  1  1
  2  1  2  2  2
  3  1  2  2  2
2 4  1  1  3  1
  5  1  1  3  1
  6  1  2  4  2
  7  1  2  4  2

The issue here is that in column 'a' the last 4 entries are 1 while they should be 2 as on the Windows machine.

EDIT:

Pandas version on both: 0.24.2

Python version on Windows: 3.7.3

Python version on Mac: 3.7.4

like image 433
CookieMaster Avatar asked Nov 07 '22 15:11

CookieMaster


1 Answers

[Mac, Python: 3.6.8]

My thinking is that the expected behaviour of nested DataFrame.applys are going to be a little convoluted to debug. My recommendation is to cut-to-the-chase by emulating what you want to achieve from apply (i.e. map then reduce):

  1. Map: Use python's native map method, followed by
  2. Reduce: Use pandas.concat to combine the results
import numpy as np
import pandas as pd

def my_apply(df, f):
    return pd.concat(map(f, df))

def foo(x):
    group, grouped = x
    grouped['d'] = [1, 1, 2, 2]
    return grouped.groupby('d').apply(lambda x: x)

T = np.array([[1,1,1]]*2 + [[1,2,2]]*2 +
             [[2,1,3]]*2 + [[2,2,4]]*2)           
df = pd.DataFrame(T, columns= ['a','b','c' ])
df = my_apply(df.groupby('a'), foo)
print(df)

Result:

   a  b  c  d
0  1  1  1  1
1  1  1  1  1
2  1  2  2  2
3  1  2  2  2
4  2  1  3  1
5  2  1  3  1
6  2  2  4  2
7  2  2  4  2

Notes:

  1. I have not tried to address the difference in implementation/architecture leading to this performance difference between Mac/Windows]
  2. I've minified your example, replaced foo2 with a lambda, feel free to swap back.
  3. The above code will throw the following warning A value is trying to be set on a copy of a slice from a DataFrame [...]. This is because we're deliberately setting a value of a copy. This is expected behaviour, not a bug. Unfortunately pandas interprets this operation as a mistake, since it probably normally is.
like image 59
kd88 Avatar answered Nov 17 '22 00:11

kd88