Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use groupby and aggregate with pandas dataframe on columns *and* index

I have a table that looks similar to this, with a multiindex on (A, B)

>>> columns = ["A", "B", "C", "D"]
... data = [
...     [1, 1, 99, 22],
...     [1, 2, 87, 24],
...     [1, 3, 65, 31],
...     [2, 1, 88, 30],
...     [2, 2, 76, 33],
...     [2, 3, 23, 32],
...     [2, 4, 38, 28],
...     [3, 1, 33, 40],
...     [3, 2, 23, 41],
...]
>>>
>>> pd_table = pd.DataFrame(data=data, columns=columns)
>>> pd_table.set_index(["A", "B"], inplace=True)
>>> print(pd_table)
      C   D
A B
1 1  99  22
  2  87  24
  3  65  31
2 1  88  30
  2  76  33
  3  23  32
  4  38  28
3 1  33  40
  2  23  41

If i want to groupby the results on an index, and apply an aggregate function on the groups, I can do this by

>>> roll_table = pd_table.groupby("A").aggregate({"C": min, "D": max})
>>> print(roll_table)
    C   D
A
1  65  31
2  23  33
3  23  41

However, this drops the B index, which I would like to keep. I also want to apply a function to this column, but apparently this fails:

>>> roll_table = pd_table.groupby("A").aggregate({"B": max, "C": min, "D": max})
>>> print(roll_index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\Python\Python38\site-packages\pandas\core\groupby\generic.py", line 928, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "...\Python\Python38\site-packages\pandas\core\base.py", line 357, in _aggregate
    raise SpecificationError("nested renamer is not supported")
pandas.core.base.SpecificationError: nested renamer is not supported

One way to solve this is to move the B index to the columns, perform the aggregation, then move it back to the index, but this seems cumbersome:

>>> roll_table = pd_table.reset_index(level="B")
>>> roll_table = roll_table.groupby("A").aggregate({"B": max, "C": min, "D": max})
>>> roll_table = roll_table.set_index("B", append=True)
>>> print(roll_table)
      C   D
A B
1 3  65  31
2 4  23  33
3 2  23  41

Is there a way to make aggregate work on ungrouped indexes as well?


To give a use case example, the indexes may be coordinates, and I want to use the first y value as a reference point. Or I may want to use "size" to keep track of how many values were grouped together.

>>> columns = ["x", "y", "Pressure"]
>>> data = [
...     [  1,   1,  99],
...     [  1,   2,  98],
...     [  1,   3,  101],
...     [  2,   2,  100],
...     [  2,   3,  96],
...     [  3,   1,  100],
...     [  3,   2,  102],
...     [  3,   3,  100],
... ]
>>>
>>> pd_table = pd.DataFrame(data=data, columns=columns)
>>> pd_table.set_index(["x", "y"], inplace=True)
>>>
>>> pd_table.reset_index(level="y", inplace=True)
>>> roll_index = pd_table.groupby("x").aggregate({"y": "first", "Pressure": "mean"})
>>> roll_index.set_index("y", append=True, inplace=True)
>>>
>>> print(roll_index)
       Pressure
x y
1 1   99.333333
2 2   98.000000
3 1  100.666667
like image 623
Frank Vel Avatar asked Nov 06 '22 06:11

Frank Vel


1 Answers

This is one option to get your result :

pd.DataFrame({key : {"B":value.index.get_level_values('B').max(), 
                     "C":value.C.min(),
                     "D":value.D.max()} 
              for key, value in pd_table.groupby("A").__iter__()}).T


     B  C   D
1   3   65  31
2   4   23  33
3   2   23  41
like image 160
sammywemmy Avatar answered Nov 14 '22 22:11

sammywemmy