I have a DataFrame which was created by group by with:
agg_df = df.groupby(['X', 'Y', 'Z']).agg({
'amount':np.sum,
'ID': pd.Series.unique,
})
After I applied some filtering on agg_df
I want to concat the IDs
agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
'amount':np.sum,
'ID': pd.Series.unique,
})
But I get an error at the second 'ID': pd.Series.unique
:
ValueError: Function does not reduce
As an example the dataframe before the second groupby is:
|amount| ID |
-----+----+----+------+-------+
X | Y | Z | | |
-----+----+----+------+-------+
a1 | b1 | c1 | 10 | 2 |
| | c2 | 11 | 1 |
a3 | b2 | c3 | 2 | [5,7] |
| | c4 | 7 | 3 |
a5 | b3 | c3 | 12 | [6,3] |
| | c5 | 17 | [3,4] |
a7 | b4 | c6 | 2 | [8,9] |
And the expected outcome should be
|amount| ID |
-----+----+------+-----------+
X | Y | | |
-----+----+------+-----------+
a1 | b1 | 21 | [2,1] |
a3 | b2 | 9 | [5,7,3] |
a5 | b3 | 29 | [6,3,4] |
a7 | b4 | 2 | [8,9] |
The order of the final IDs is not important.
Edit: I have come up with one solution. But its not quite elegant:
def combine_ids(x):
def asarray(elem):
if isinstance(elem, collections.Iterable):
return np.asarray(list(elem))
return elem
res = np.array([asarray(elem) for elem in x.values])
res = np.unique(np.hstack(res))
return set(res)
agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
'amount':np.sum,
'ID': combine_ids,
})
Edit2: Another solution which works in my case is:
combine_ids = lambda x: set(np.hstack(x.values))
Edit3:
It seems that it is not possible to avoid set()
as resulting value, due to implementation of Pandas aggregation function implemention. Details in https://stackoverflow.com/a/16975602/3142459
merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.
Groupby preserves the order of rows within each group.
We'll pass two dataframes to pd. contact() method in the form of a list and mention in which axis you want to concat, i.e. axis=0 to concat along rows, axis=1 to concat along columns.
If you're fine using sets as your type (which I probably would), then I would go with:
agg_df = df.groupby(['x','y','z']).agg({
'amount': np.sum, 'id': lambda s: set(s)})
agg_df.reset_index().groupby(['x','y']).agg({
'amount': np.sum, 'id': lambda s: set.union(*s)})
...which works for me. For some reason, the lambda s: set(s)
works, but set doesn't (I'm guessing somewhere pandas isn't doing duck-typing correctly).
If your data is large, you'll probably want the following instead of lambda s: set.union(*s)
:
from functools import reduce
# can't partial b/c args are positional-only
def cheaper_set_union(s):
return reduce(set.union, s, set())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With