Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way for creating columns in a multi level columns pandas dataframe

Supposing that I have a pandas multi level columns data-frame df like this:

  | A     |  B     -> first level
---------------------------------
  | x  y  |  x   y -> second level
---------------------------------
0|  5  5  |  1   5
1|  3  1  |  4   7
2|  1  4  |  10  20
3| 50  8  |  7   8

How can I create a new column with the difference between x and y for each level?

I know that I could do one by one, like this:

df["A"]["diff"] = df["A"].x - df["A"].y
df["B"]["diff"] = df["B"].x - df["B"].y

The final output would be:

  | A          |  B            -> first level
-----------------------------------------------
  | x  y  diff |  x   y   diff -> second level
-----------------------------------------------
0|  5  5  0    |  1   5   -4
1|  3  1  2    |  4   7   -3
2|  1  4  -3   |  10  20  -10
3| 50  8  42   |  7   8   -1

Is there a one line operation to apply this column creation for all levels at once?

My solution this seems very repetitive, and in my case I may have several (more than 10 labels) at first level).

Is there a better way of doing it?

like image 893
Henrique Branco Avatar asked Nov 04 '25 08:11

Henrique Branco


2 Answers

sample df:

df = pd.DataFrame(data=[[1,2,3,4,5,6,1,2,3], [7,8,9,10,11,12,7,8,9], [13,14,15,16,17,18,4,5,6]], index=pd.date_range('2004-01-01', '2004-01-03'))
df.columns = pd.MultiIndex.from_product([['x', 'y', 'z'], list('abc')])

df:

x y z
a b c a b c a b c
2004-01-01 1 2 3 4 5 6 1 2 3
2004-01-02 7 8 9 10 11 12 7 8 9
2004-01-03 13 14 15 16 17 18 4 5 6

df1 = df.sum(level=0, axis=1)
df1.columns = pd.MultiIndex.from_product([df1.columns,["sum"]])
df1 = pd.concat([df,df1],axis=1).sort_index(1)

df1:

x y z
a b c sum a b c sum a b c sum
2004-01-01 1 2 3 6 4 5 6 15 1 2 3 6
2004-01-02 7 8 9 24 10 11 12 33 7 8 9 24
2004-01-03 13 14 15 42 16 17 18 51 4 5 6 15

Edit:

Subtraction:

df2 = df.T.groupby(level=[0]).diff().T.loc[:,df.columns.get_level_values(1).isin(['c'])]
df2 = pd.concat([df,df2.rename(columns={'c':'diff b/w b and c'})],axis=1).sort_index(1)

df2:

x y z
a b c diff b/w b and c a b c diff b/w b and c a b c diff b/w b and c
2004-01-01 1 2 3 1.0 4 5 6 1.0 1 2 3 1.0
2004-01-02 7 8 9 1.0 10 11 12 1.0 7 8 9 1.0
2004-01-03 13 14 15 1.0 16 17 18 1.0 4 5 6 1.0

Edit(Final optimized):

df2 = (df.T.groupby(level=[0]).diff().T.rename(mapper=lambda x: f'diff_{x}', 
            axis='columns',
            level=1,
            ))
df2 = pd.concat([df,df2],axis=1).sort_index(1)

df2:

x y z
a b c diff_a diff_b diff_c a b c diff_a diff_b diff_c a b c diff_a diff_b diff_c
2004-01-01 1 2 3 NaN 1.0 1.0 4 5 6 NaN 1.0 1.0 1 2 3 NaN 1.0 1.0
2004-01-02 7 8 9 NaN 1.0 1.0 10 11 12 NaN 1.0 1.0 7 8 9 NaN 1.0 1.0
2004-01-03 13 14 15 NaN 1.0 1.0 16 17 18 NaN 1.0 1.0 4 5 6 NaN 1.0 1.0

As mentioned by Shubham Sharma :)

You can use:

for c in df.columns.levels[0]:
    df.loc[:, (c, 'diff')] = df[(c, 'b')] - df[(c, 'a')]

df = df.sort_index(level=0, axis=1)
like image 136
Pygirl Avatar answered Nov 07 '25 00:11

Pygirl


You can try using a little reshaping and pd.DataFrame.eval, however sort_index does an alphabetical sort on column headers.

df.stack(0).eval('zdiff = x - y').unstack().swaplevel(0, 1, axis=1).sort_index(axis=1)

Output:

    A            B          
    x  y zdiff   x   y zdiff
0                           
0   5  5     0   1   5    -4
1   3  1     2   4   7    -3
2   1  4    -3  10  20   -10
3  50  8    42   7   8    -1
like image 25
Scott Boston Avatar answered Nov 07 '25 01:11

Scott Boston



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!