Pandas: aggregate when column contains numpy arrays

Tags:

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.

e.g.

import pandas as pd
import numpy as np

DF = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','arraydata'])

This works the way I would expect it to:

DF.groupby('category').agg(sum)

output:

             arraydata
category 1   [50 70 90]
         2   [20 30 40]

However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:

g=DF.groupby('category')
g.agg({'arraydata':sum})

Here is another:

g=DF.groupby('category')
g['arraydata'].agg(sum)

Both give the same output:

Exception: must produce aggregated value

However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?

Thanks

928

asked Jun 07 '13 02:06

pteehan

1 Answers

One, perhaps more clunky way to do it would be to iterate over the GroupBy object (it generates (grouping_value, df_subgroup) tuples. For example, to achieve what you want here, you could do:

grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")

This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.

Diving into the Internals

The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.

def _aggregate_named(self, func, *args, **kwargs):
    result = {}

    for name, group in self:
        group.name = name
        output = func(group, *args, **kwargs)
        if isinstance(output, np.ndarray):
            raise Exception('Must produce aggregated value')
        result[name] = self._try_cast(output, group)

    return result

My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:

DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})

Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.

result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)

How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.

answered Oct 06 '22 23:10

Jeff Tratner

Related questions
                            
                                Send image using socket programming Python
                            
                                Is it possible to run Selenium scripts without having an X server running, too?
                            
                                How do I know if I have successfully created a table (Python, Psycopg2)?
                            
                                Python _winreg woes
                            
                                Getting IP address from HTTP POST request using Python
                            
                                Kivy installation does not find GL/gl.h?
                            
                                Does Django have an equivalent of Rails's "bundle install"?
                            
                                Override function declaration in autodoc for sphinx
                            
                                python - returning a default value
                            
                                Python MySQLdb: Query parameters as a named dictionary [closed]
                            
                                Why Tor cant access localhost pages
                            
                                Outerzip / zip longest function (with multiple fill values)
                            
                                Python equivalent of Matlab textscan
                            
                                Python: How can I define a class in a doctest?
                            
                                Python- insert a character into a string
                            
                                web2py - allow external access - how?
                            
                                Why am I getting a "no module named cx_Freeze" error after installing cx_freeze?
                            
                                Pandas column addition/subtraction
                            
                                How to avoid a Broken Pipe error when printing a large amount of formatted data?
                            
                                Why does Python's itertools.cycle need to create a copy of the iterable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: aggregate when column contains numpy arrays

Tags:

python

pandas

aggregation

numpy

pteehan

People also ask

1 Answers

Diving into the Internals

Jeff Tratner

Recent Activity

Donate For Us