I have a dataframe structured like this:
First A B
Second bar baz foo bar baz foo
Third cat dog cat dog cat dog cat dog cat dog cat dog
0 3 8 7 7 4 7 5 3 2 2 6 2
1 8 6 5 7 8 7 1 8 6 0 3 9
2 9 2 2 9 7 3 1 8 4 1 0 8
3 3 6 0 6 3 2 2 6 2 4 6 9
4 7 6 4 3 1 5 0 4 8 4 8 1
So there are three column levels. I want to add a new column on the second level where for each of the third levels a computation is performed, for example 'new' = 'foo' + 'bar'. So the resulting dataframe would look like:
First A B
Second bar baz foo new bar baz foo new
Third cat dog cat dog cat dog cat dog cat dog cat dog cat dog cat dog
0 3 8 7 7 4 7 7 15 5 3 2 2 6 2 11 5
1 8 6 5 7 8 7 16 13 1 8 6 0 3 9 4 17
2 9 2 2 9 7 3 16 5 1 8 4 1 0 8 1 16
3 3 6 0 6 3 2 6 8 2 6 2 4 6 9 8 15
4 7 6 4 3 1 5 8 11 0 4 8 4 8 1 8 5
I have found a workaround which is listed at the end of this post, but its not at all 'panda-style' and prone to errors. The apply or transform function on a group seems like the right way to go but after hours of trying I still do not succeed. I figured the correct way should be something like:
def func(data):
fi = data.columns[0][0]
th = data.columns[0][2]
data[(fi,'new',th)] = data[(fi,'foo',th)] + data[(fi,'bar',th)]
print data
return data
print grouped.apply(func)
The new column is properly added within the function, but is not returned. Using the same function with transform would work if the 'new' column already exists in the df, but how do you add a new column at a specific level 'on the fly' or before grouping?
The code to generate the sample df is:
import pandas, itertools
first = ['A','B']
second = ['foo','bar','baz']
third = ['dog', 'cat']
tuples = []
for tup in itertools.product(first, second, third):
tuples.append(tup)
columns = pandas.MultiIndex.from_tuples(tuples, names=['First','Second','Third'])
data = np.random.randint(0,10,(5, 12))
df = pandas.DataFrame(data, columns=columns)
And my workaround:
dfnew = None
grouped = df.groupby(by=None, level=[0,2], axis=1)
for name, group in grouped:
newparam = group.xs('foo', axis=1, level=1) + group.xs('bar', axis=1, level=1)
dftmp = group.join(pandas.DataFrame(np.array(newparam), columns=pandas.MultiIndex.from_tuples([(group.columns[0][0], 'new', group.columns[0][2])], names=['First','Second', 'Third'])))
if dfnew is None:
dfnew = dftmp
else:
dfnew = pandas.concat([dfnew, dftmp], axis=1)
print dfnew.sort_index(axis=1)
Which works, but creating a new dataframe for each group and 'manually' assigning the levels is a really bad practice.
So what is the proper way to do this? I found several posts dealing with similar questions, but all of these had only 1 level of columns, and that's exactly what I'm struggling with.
You can also reset_index() on your groupby result to get back a dataframe with the name column now accessible. If you perform an operation on a single column the return will be a series with multiindex and you can simply apply pd. DataFrame to it and then reset_index. Show activity on this post.
You can use the following basic syntax to use GroupBy on a pandas DataFrame with a multiindex: #calculate sum by level 0 and 1 of multiindex df. groupby(level=[0,1]). sum() #calculate count by level 0 and 1 of multiindex df.
First look at GroupBy We did not tell GroupBy which column we wanted it to apply the aggregation function on, so it applied it to all the relevant columns and returned the output. But fortunately, GroupBy object supports column indexing just like a DataFrame!
There definitely is a weakness in the API here but I'm not sure off the top of my head to make it easier to do what you're doing. Here's one simple way around this, at least for your example:
In [20]: df
Out[20]:
First A B
Second foo bar baz foo bar baz
Third dog cat dog cat dog cat dog cat dog cat dog cat
0 7 2 9 3 3 0 5 9 8 2 0 6
1 1 4 1 7 2 3 2 3 1 0 4 0
2 6 5 0 6 6 1 5 1 7 4 3 6
3 4 8 1 9 0 3 9 2 3 1 5 9
4 6 1 1 5 1 2 2 6 3 7 2 1
In [21]: rdf = df.stack(['First', 'Third'])
In [22]: rdf['new'] = rdf.foo + rdf.bar
In [23]: rdf
Out[23]:
Second bar baz foo new
First Third
0 A cat 3 0 2 5
dog 9 3 7 16
B cat 2 6 9 11
dog 8 0 5 13
1 A cat 7 3 4 11
dog 1 2 1 2
B cat 0 0 3 3
dog 1 4 2 3
2 A cat 6 1 5 11
dog 0 6 6 6
B cat 4 6 1 5
dog 7 3 5 12
3 A cat 9 3 8 17
dog 1 0 4 5
B cat 1 9 2 3
dog 3 5 9 12
4 A cat 5 2 1 6
dog 1 1 6 7
B cat 7 1 6 13
dog 3 2 2 5
In [24]: rdf.unstack(['First', 'Third'])
Out[24]:
Second bar baz foo new
First A B A B A B A B
Third cat dog cat dog cat dog cat dog cat dog cat dog cat dog cat dog
0 3 9 2 8 0 3 6 0 2 7 9 5 5 16 11 13
1 7 1 0 1 3 2 0 4 4 1 3 2 11 2 3 3
2 6 0 4 7 1 6 6 3 5 6 1 5 11 6 5 12
3 9 1 1 3 3 0 9 5 8 4 2 9 17 5 3 12
4 5 1 7 3 2 1 1 2 1 6 6 2 6 7 13 5
And you can of course rearrange to your heart's content:
In [28]: rdf.unstack(['First', 'Third']).reorder_levels(['First', 'Second', 'Third'], axis=1).sortlevel(0, axis=1)
Out[28]:
First A B
Second bar baz foo new bar baz foo new
Third cat dog cat dog cat dog cat dog cat dog cat dog cat dog cat dog
0 3 9 0 3 2 7 5 16 2 8 6 0 9 5 11 13
1 7 1 3 2 4 1 11 2 0 1 0 4 3 2 3 3
2 6 0 1 6 5 6 11 6 4 7 6 3 1 5 5 12
3 9 1 3 0 8 4 17 5 1 3 9 5 2 9 3 12
4 5 1 2 1 1 6 6 7 7 3 1 2 6 2 13 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With