Supposing that I have a pandas multi level columns data-frame df like this:
| A | B -> first level
---------------------------------
| x y | x y -> second level
---------------------------------
0| 5 5 | 1 5
1| 3 1 | 4 7
2| 1 4 | 10 20
3| 50 8 | 7 8
How can I create a new column with the difference between x and y for each level?
I know that I could do one by one, like this:
df["A"]["diff"] = df["A"].x - df["A"].y
df["B"]["diff"] = df["B"].x - df["B"].y
The final output would be:
| A | B -> first level
-----------------------------------------------
| x y diff | x y diff -> second level
-----------------------------------------------
0| 5 5 0 | 1 5 -4
1| 3 1 2 | 4 7 -3
2| 1 4 -3 | 10 20 -10
3| 50 8 42 | 7 8 -1
Is there a one line operation to apply this column creation for all levels at once?
My solution this seems very repetitive, and in my case I may have several (more than 10 labels) at first level).
Is there a better way of doing it?
sample df:
df = pd.DataFrame(data=[[1,2,3,4,5,6,1,2,3], [7,8,9,10,11,12,7,8,9], [13,14,15,16,17,18,4,5,6]], index=pd.date_range('2004-01-01', '2004-01-03'))
df.columns = pd.MultiIndex.from_product([['x', 'y', 'z'], list('abc')])
df:
| x | y | z | |||||||
|---|---|---|---|---|---|---|---|---|---|
| a | b | c | a | b | c | a | b | c | |
| 2004-01-01 | 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 |
| 2004-01-02 | 7 | 8 | 9 | 10 | 11 | 12 | 7 | 8 | 9 |
| 2004-01-03 | 13 | 14 | 15 | 16 | 17 | 18 | 4 | 5 | 6 |
df1 = df.sum(level=0, axis=1)
df1.columns = pd.MultiIndex.from_product([df1.columns,["sum"]])
df1 = pd.concat([df,df1],axis=1).sort_index(1)
df1:
| x | y | z | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | c | sum | a | b | c | sum | a | b | c | sum | |
| 2004-01-01 | 1 | 2 | 3 | 6 | 4 | 5 | 6 | 15 | 1 | 2 | 3 | 6 |
| 2004-01-02 | 7 | 8 | 9 | 24 | 10 | 11 | 12 | 33 | 7 | 8 | 9 | 24 |
| 2004-01-03 | 13 | 14 | 15 | 42 | 16 | 17 | 18 | 51 | 4 | 5 | 6 | 15 |
Subtraction:
df2 = df.T.groupby(level=[0]).diff().T.loc[:,df.columns.get_level_values(1).isin(['c'])]
df2 = pd.concat([df,df2.rename(columns={'c':'diff b/w b and c'})],axis=1).sort_index(1)
df2:
| x | y | z | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | c | diff b/w b and c | a | b | c | diff b/w b and c | a | b | c | diff b/w b and c | |
| 2004-01-01 | 1 | 2 | 3 | 1.0 | 4 | 5 | 6 | 1.0 | 1 | 2 | 3 | 1.0 |
| 2004-01-02 | 7 | 8 | 9 | 1.0 | 10 | 11 | 12 | 1.0 | 7 | 8 | 9 | 1.0 |
| 2004-01-03 | 13 | 14 | 15 | 1.0 | 16 | 17 | 18 | 1.0 | 4 | 5 | 6 | 1.0 |
df2 = (df.T.groupby(level=[0]).diff().T.rename(mapper=lambda x: f'diff_{x}',
axis='columns',
level=1,
))
df2 = pd.concat([df,df2],axis=1).sort_index(1)
df2:
| x | y | z | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | c | diff_a | diff_b | diff_c | a | b | c | diff_a | diff_b | diff_c | a | b | c | diff_a | diff_b | diff_c | |
| 2004-01-01 | 1 | 2 | 3 | NaN | 1.0 | 1.0 | 4 | 5 | 6 | NaN | 1.0 | 1.0 | 1 | 2 | 3 | NaN | 1.0 | 1.0 |
| 2004-01-02 | 7 | 8 | 9 | NaN | 1.0 | 1.0 | 10 | 11 | 12 | NaN | 1.0 | 1.0 | 7 | 8 | 9 | NaN | 1.0 | 1.0 |
| 2004-01-03 | 13 | 14 | 15 | NaN | 1.0 | 1.0 | 16 | 17 | 18 | NaN | 1.0 | 1.0 | 4 | 5 | 6 | NaN | 1.0 | 1.0 |
As mentioned by Shubham Sharma :)
You can use:
for c in df.columns.levels[0]:
df.loc[:, (c, 'diff')] = df[(c, 'b')] - df[(c, 'a')]
df = df.sort_index(level=0, axis=1)
You can try using a little reshaping and pd.DataFrame.eval, however sort_index does an alphabetical sort on column headers.
df.stack(0).eval('zdiff = x - y').unstack().swaplevel(0, 1, axis=1).sort_index(axis=1)
Output:
A B
x y zdiff x y zdiff
0
0 5 5 0 1 5 -4
1 3 1 2 4 7 -3
2 1 4 -3 10 20 -10
3 50 8 42 7 8 -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With