I have a DataFrame which was created by group by with: <pre class="prettyprint"><code>agg_df = df.groupby(['X', 'Y', 'Z']).agg({ 'amount':np.sum, 'ID': pd.Series.unique, }) </code></pre> After I applied some filtering on <code>agg_df</code> I want to concat the IDs <pre class="prettyprint"><code>agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now 'amount':np.sum, 'ID': pd.Series.unique, }) </code></pre> But I get an error at the second <code>'ID': pd.Series.unique</code>: <pre class="prettyprint"><code>ValueError: Function does not reduce </code></pre> As an example the dataframe before the second groupby is: <pre class="prettyprint"><code> |amount| ID | -----+----+----+------+-------+ X | Y | Z | | | -----+----+----+------+-------+ a1 | b1 | c1 | 10 | 2 | | | c2 | 11 | 1 | a3 | b2 | c3 | 2 | [5,7] | | | c4 | 7 | 3 | a5 | b3 | c3 | 12 | [6,3] | | | c5 | 17 | [3,4] | a7 | b4 | c6 | 2 | [8,9] | </code></pre> And the expected outcome should be <pre class="prettyprint"><code> |amount| ID | -----+----+------+-----------+ X | Y | | | -----+----+------+-----------+ a1 | b1 | 21 | [2,1] | a3 | b2 | 9 | [5,7,3] | a5 | b3 | 29 | [6,3,4] | a7 | b4 | 2 | [8,9] | </code></pre> The order of the final IDs is not important. Edit: I have come up with one solution. But its not quite elegant: <pre class="prettyprint"><code>def combine_ids(x): def asarray(elem): if isinstance(elem, collections.Iterable): return np.asarray(list(elem)) return elem res = np.array([asarray(elem) for elem in x.values]) res = np.unique(np.hstack(res)) return set(res) agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now 'amount':np.sum, 'ID': combine_ids, }) </code></pre> Edit2: Another solution which works in my case is: <pre class="prettyprint"><code>combine_ids = lambda x: set(np.hstack(x.values)) </code></pre> Edit3: It seems that it is not possible to avoid <code>set()</code> as resulting value, due to implementation of Pandas aggregation function implemention. Details in https://stackoverflow.com/a/16975602/3142459

If you're fine using sets as your type (which I probably would), then I would go with: <pre class="prettyprint"><code>agg_df = df.groupby(['x','y','z']).agg({ 'amount': np.sum, 'id': lambda s: set(s)}) agg_df.reset_index().groupby(['x','y']).agg({ 'amount': np.sum, 'id': lambda s: set.union(*s)}) </code></pre> ...which works for me. For some reason, the <code>lambda s: set(s)</code> works, but set doesn't (I'm guessing somewhere pandas isn't doing duck-typing correctly). If your data is large, you'll probably want the following instead of <code>lambda s: set.union(*s)</code>: <pre class="prettyprint"><code>from functools import reduce # can't partial b/c args are positional-only def cheaper_set_union(s): return reduce(set.union, s, set()) </code></pre>

pandas concat arrays on groupby

Tags:

python

pandas

I have a DataFrame which was created by group by with:

agg_df = df.groupby(['X', 'Y', 'Z']).agg({
    'amount':np.sum,
    'ID': pd.Series.unique,
})

After I applied some filtering on agg_df I want to concat the IDs

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': pd.Series.unique,
})

But I get an error at the second 'ID': pd.Series.unique:

ValueError: Function does not reduce

As an example the dataframe before the second groupby is:

               |amount|  ID   |
-----+----+----+------+-------+
  X  | Y  | Z  |      |       |
-----+----+----+------+-------+
  a1 | b1 | c1 |  10  | 2     |
     |    | c2 |  11  | 1     |
  a3 | b2 | c3 |   2  | [5,7] |
     |    | c4 |   7  | 3     |
  a5 | b3 | c3 |  12  | [6,3] |
     |    | c5 |  17  | [3,4] |
  a7 | b4 | c6 |  2   | [8,9] |

And the expected outcome should be

          |amount|  ID       |
-----+----+------+-----------+
  X  | Y  |      |           |
-----+----+------+-----------+
  a1 | b1 |  21  | [2,1]     |
  a3 | b2 |   9  | [5,7,3]   |
  a5 | b3 |  29  | [6,3,4]   |
  a7 | b4 |  2   | [8,9]     |

The order of the final IDs is not important.

Edit: I have come up with one solution. But its not quite elegant:

def combine_ids(x):
   def asarray(elem):
      if isinstance(elem, collections.Iterable):
         return np.asarray(list(elem))
      return elem

   res = np.array([asarray(elem) for elem in x.values])
   res = np.unique(np.hstack(res))
   return set(res)

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': combine_ids,
})

Edit2: Another solution which works in my case is:

combine_ids = lambda x: set(np.hstack(x.values))

Edit3: It seems that it is not possible to avoid set() as resulting value, due to implementation of Pandas aggregation function implemention. Details in https://stackoverflow.com/a/16975602/3142459

944

asked Sep 16 '15 10:09

user3142459

1 Answers

If you're fine using sets as your type (which I probably would), then I would go with:

agg_df = df.groupby(['x','y','z']).agg({
    'amount': np.sum, 'id': lambda s: set(s)})
agg_df.reset_index().groupby(['x','y']).agg({
    'amount': np.sum, 'id': lambda s: set.union(*s)})

...which works for me. For some reason, the lambda s: set(s) works, but set doesn't (I'm guessing somewhere pandas isn't doing duck-typing correctly).

If your data is large, you'll probably want the following instead of lambda s: set.union(*s):

from functools import reduce
# can't partial b/c args are positional-only
def cheaper_set_union(s):
    return reduce(set.union, s, set())

128

answered Sep 20 '22 20:09

metaperture

Related questions
                            
                                The Free energy approximation Equation in Restriction Boltzmann Machines
                            
                                Dynamic Hosts and Parallel Tasks with Fabric library
                            
                                how to export per-vertex UV coordinates in Blender export script
                            
                                SQLAlchemy logging of changes with date and user
                            
                                Node.JS Regex engine fails on large input
                            
                                Specify list of possible values for Pandas get_dummies
                            
                                Python multiprocessing is taking much longer than single processing
                            
                                How to ignore files or directories in nose2?
                            
                                Running Cython code within the interpreter
                            
                                Is there anything similar to "self" inside a Python generator?
                            
                                How to disable accelerators when typing text in GTK+
                            
                                Lazily transpose a list in Python
                            
                                Converting text in Matplotlib when exporting .eps files
                            
                                from django.contrib.gis.geos import GEOSException, GEOSGeometry, fromstr ImportError: cannot import name GEOSException
                            
                                How to send message to multiple recipients?
                            
                                How many memory copies do object variables in Python have? [duplicate]
                            
                                Why is this numpy array too big to load?
                            
                                How to do unit test on Autobahn applications using Twisted Trial?
                            
                                Not aligned Sympy's nice pritting of division
                            
                                Deploy-time commands inside Docker on Elastic Beanstalk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With