I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple
in transform I give set, frozenset, dict
, but if I give list
I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list
is just not handeled correctly as argument to .transform
, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg
method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform
is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints
(remember for [['wave']]
which creates a one-columed dataframe transform(tuple)
indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg
instead of transform
it works for both ['wave']
and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames
are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform
is originally implemented on top of .agg
:
# pandas/core/generic.py
@Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform
always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg
function on the DataFrame
, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform
tries to return a Series
with the same length.
In the process to transforming a groupby
element which is a slice from self
and then concatenating this again, lists gets unpacked to the same length of index as @Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform
:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With