I am working on a problem where I am using a nested groupby.apply on a pandas DataFrame. During the first apply I am adding a column that I am using for the second inner groupby.apply. The combined result looks faulty to me. Can anyone explain to me why the below phenomen happens and how to reliably fix it?
Here is a minimal example:
import numpy as np
import pandas as pd
T = np.array( [
[1,1,1],
[1,1,1],
[1,2,2],
[1,2,2],
[2,1,3],
[2,1,3],
[2,2,4],
[2,2,4],
])
df = pd.DataFrame(T, columns= ['a','b','c' ])
print(df)
def foo2(x):
return x
def foo(x):
print("*" * 80 )
# Add column d and groupby/apply on column 'd'
x['d'] = [1, 1, 2, 2]
x = x.groupby('d').apply(foo2)
print(x)
print("*" * 80)
return x
# Apply first groupby/apply on column 'a'
df = df.groupby('a').apply( foo)
print("*"*80)
print("*"*80)
print(df)
When I run the above code on my Windows laptop I get the expected result
a b c d
a
1 0 1 1 1 1
1 1 1 1 1
2 1 2 2 2
3 1 2 2 2
2 4 2 1 3 1
5 2 1 3 1
6 2 2 4 2
7 2 2 4 2
Running the same code on a Mac gives
a b c d
a
1 0 1 1 1 1
1 1 1 1 1
2 1 2 2 2
3 1 2 2 2
2 4 1 1 3 1
5 1 1 3 1
6 1 2 4 2
7 1 2 4 2
The issue here is that in column 'a' the last 4 entries are 1 while they should be 2 as on the Windows machine.
EDIT:
Pandas version on both: 0.24.2
Python version on Windows: 3.7.3
Python version on Mac: 3.7.4
[Mac, Python: 3.6.8]
My thinking is that the expected behaviour of nested DataFrame.apply
s are going to be a little convoluted to debug. My recommendation is to cut-to-the-chase by emulating what you want to achieve from apply
(i.e. map then reduce):
map
method, followed bypandas.concat
to combine the resultsimport numpy as np
import pandas as pd
def my_apply(df, f):
return pd.concat(map(f, df))
def foo(x):
group, grouped = x
grouped['d'] = [1, 1, 2, 2]
return grouped.groupby('d').apply(lambda x: x)
T = np.array([[1,1,1]]*2 + [[1,2,2]]*2 +
[[2,1,3]]*2 + [[2,2,4]]*2)
df = pd.DataFrame(T, columns= ['a','b','c' ])
df = my_apply(df.groupby('a'), foo)
print(df)
Result:
a b c d
0 1 1 1 1
1 1 1 1 1
2 1 2 2 2
3 1 2 2 2
4 2 1 3 1
5 2 1 3 1
6 2 2 4 2
7 2 2 4 2
Notes:
foo2
with a lambda
, feel free to swap back.A value is trying to be set on a copy of a slice from a DataFrame [...]
. This is because we're deliberately setting a value of a copy. This is expected behaviour, not a bug. Unfortunately pandas
interprets this operation as a mistake, since it probably normally is.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With