Add a column with a groupby on a hierarchical dataframe

Tags:

I have a dataframe structured like this:

First     A                             B                         
Second  bar       baz       foo       bar       baz       foo     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    5    3    2    2    6    2
1         8    6    5    7    8    7    1    8    6    0    3    9
2         9    2    2    9    7    3    1    8    4    1    0    8
3         3    6    0    6    3    2    2    6    2    4    6    9
4         7    6    4    3    1    5    0    4    8    4    8    1

So there are three column levels. I want to add a new column on the second level where for each of the third levels a computation is performed, for example 'new' = 'foo' + 'bar'. So the resulting dataframe would look like:

First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    7   15    5    3    2    2    6    2   11    5
1         8    6    5    7    8    7   16   13    1    8    6    0    3    9    4   17
2         9    2    2    9    7    3   16    5    1    8    4    1    0    8    1   16
3         3    6    0    6    3    2    6    8    2    6    2    4    6    9    8   15
4         7    6    4    3    1    5    8   11    0    4    8    4    8    1    8    5

I have found a workaround which is listed at the end of this post, but its not at all 'panda-style' and prone to errors. The apply or transform function on a group seems like the right way to go but after hours of trying I still do not succeed. I figured the correct way should be something like:

def func(data):

    fi = data.columns[0][0]
    th = data.columns[0][2]

    data[(fi,'new',th)] = data[(fi,'foo',th)] + data[(fi,'bar',th)]

    print data
    return data

print grouped.apply(func)

The new column is properly added within the function, but is not returned. Using the same function with transform would work if the 'new' column already exists in the df, but how do you add a new column at a specific level 'on the fly' or before grouping?

The code to generate the sample df is:

import pandas, itertools

first = ['A','B']
second = ['foo','bar','baz']
third = ['dog', 'cat']

tuples = []
for tup in itertools.product(first, second, third):
    tuples.append(tup)

columns = pandas.MultiIndex.from_tuples(tuples, names=['First','Second','Third'])

data = np.random.randint(0,10,(5, 12))
df = pandas.DataFrame(data, columns=columns)

And my workaround:

dfnew = None
grouped = df.groupby(by=None, level=[0,2], axis=1)

for name, group in grouped:
    newparam = group.xs('foo', axis=1, level=1) + group.xs('bar', axis=1, level=1)

    dftmp = group.join(pandas.DataFrame(np.array(newparam), columns=pandas.MultiIndex.from_tuples([(group.columns[0][0], 'new', group.columns[0][2])], names=['First','Second', 'Third'])))

    if dfnew is None:
        dfnew = dftmp
    else:
        dfnew = pandas.concat([dfnew, dftmp], axis=1)

print dfnew.sort_index(axis=1)

Which works, but creating a new dataframe for each group and 'manually' assigning the levels is a really bad practice.

So what is the proper way to do this? I found several posts dealing with similar questions, but all of these had only 1 level of columns, and that's exactly what I'm struggling with.

657

asked Oct 18 '12 08:10

Rutger Kassies

1 Answers

There definitely is a weakness in the API here but I'm not sure off the top of my head to make it easier to do what you're doing. Here's one simple way around this, at least for your example:

In [20]: df
Out[20]: 
First     A                             B                         
Second  foo       bar       baz       foo       bar       baz     
Third   dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat
0         7    2    9    3    3    0    5    9    8    2    0    6
1         1    4    1    7    2    3    2    3    1    0    4    0
2         6    5    0    6    6    1    5    1    7    4    3    6
3         4    8    1    9    0    3    9    2    3    1    5    9
4         6    1    1    5    1    2    2    6    3    7    2    1

In [21]: rdf = df.stack(['First', 'Third'])

In [22]: rdf['new'] = rdf.foo + rdf.bar

In [23]: rdf
Out[23]: 
Second         bar  baz  foo  new
  First Third                    
0 A     cat      3    0    2    5
        dog      9    3    7   16
  B     cat      2    6    9   11
        dog      8    0    5   13
1 A     cat      7    3    4   11
        dog      1    2    1    2
  B     cat      0    0    3    3
        dog      1    4    2    3
2 A     cat      6    1    5   11
        dog      0    6    6    6
  B     cat      4    6    1    5
        dog      7    3    5   12
3 A     cat      9    3    8   17
        dog      1    0    4    5
  B     cat      1    9    2    3
        dog      3    5    9   12
4 A     cat      5    2    1    6
        dog      1    1    6    7
  B     cat      7    1    6   13
        dog      3    2    2    5

In [24]: rdf.unstack(['First', 'Third'])
Out[24]: 
Second  bar                 baz                 foo                 new               
First     A         B         A         B         A         B         A         B     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    2    8    0    3    6    0    2    7    9    5    5   16   11   13
1         7    1    0    1    3    2    0    4    4    1    3    2   11    2    3    3
2         6    0    4    7    1    6    6    3    5    6    1    5   11    6    5   12
3         9    1    1    3    3    0    9    5    8    4    2    9   17    5    3   12
4         5    1    7    3    2    1    1    2    1    6    6    2    6    7   13    5

And you can of course rearrange to your heart's content:

In [28]: rdf.unstack(['First', 'Third']).reorder_levels(['First', 'Second', 'Third'], axis=1).sortlevel(0, axis=1)
Out[28]: 
First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    0    3    2    7    5   16    2    8    6    0    9    5   11   13
1         7    1    3    2    4    1   11    2    0    1    0    4    3    2    3    3
2         6    0    1    6    5    6   11    6    4    7    6    3    1    5    5   12
3         9    1    3    0    8    4   17    5    1    3    9    5    2    9    3   12
4         5    1    2    1    1    6    6    7    7    3    1    2    6    2   13    5

133

answered Oct 19 '22 06:10

Wes McKinney

Related questions
                            
                                How to tell what python version libboost_python.so is using?
                            
                                Python 2.7.2 and Google App Engine SDK 1.6.1 on Win 7 Home Premium not working
                            
                                How can I configure Auto Scaling with boto using scaling policies and metrics?
                            
                                py2exe `ImportError: No module named backend_tkagg`
                            
                                pros and cons of using factory vs regular constructor
                            
                                Virtual environments and embedding Python
                            
                                Can Python's argparse permute argument order like gnu getopt?
                            
                                Realtime options (Websockets, flash, polling) for Django?
                            
                                Built-in non-data version of property?
                            
                                What's the best design for a RESTful URI with multiple mandatory parameters?
                            
                                Blueprint-specific assets in Flask using Flask-Assets?
                            
                                How to install english.pickle for nltk on an off line Linux machine
                            
                                Fabric env.hosts and run in the same method => No hosts found
                            
                                Do python lists have an equivalent for __contains__ that tests for identity?
                            
                                Extended slice in python?
                            
                                Are there generic Python libraries that provide 'signals' (event) capability like Django signals?
                            
                                How to turn off logging statements without removing them from the code
                            
                                Python traceback module for beginners
                            
                                What is the difference between Abstract classes and Mixins in Python django
                            
                                Any difference between dir() and locals() in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Add a column with a groupby on a hierarchical dataframe

Tags:

python

pandas

group-by

Rutger Kassies

People also ask

1 Answers

Wes McKinney

Recent Activity

Donate For Us