Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Share/percent across a list of columns in a pandas agg

Tags:

python

pandas

I have this kind of data frame

dat = [{"date": datetime.date(2021,1,1), "c_id" : "a", "var1": 2, "var2":  1, "var3" : 10 },
       {"date": datetime.date(2021,1,1), "c_id" : "b", "var1": 2, "var2":  0, "var3" : 20 },
       {"date": datetime.date(2021,2,1), "c_id" : "a", "var1": 2, "var2":  1, "var3" : 30 },
       {"date": datetime.date(2021,2,1), "c_id" : "b", "var1": 2, "var2":  3, "var3" : 10 },
       {"date": datetime.date(2021,3,1), "c_id" : "a", "var1": 2, "var2":  1, "var3" : 30 },
       {"date": datetime.date(2021,3,1), "c_id" : "b", "var1": 2, "var2":  3, "var3" : 20 },
       ]

df = pd.DataFrame(dat)

>>> df
         date c_id  var1  var2  var3
0  2021-01-01    a     2     1    10
1  2021-01-01    b     2     0    20
2  2021-02-01    a     2     1    30
3  2021-02-01    b     2     3    10
4  2021-03-01    a     2     1    30
5  2021-03-01    b     2     3    20

I'd like to have the share of these 3 named variables per (date, c_id). So for example...

>>> df
         date c_id  var1  var2  var3  var1_share var2_share var3_share
0  2021-01-01    a     2     1    10        0.15       0.07       0.76
1  2021-01-01    b     2     0    20        0.09       0.00       0.90
2  2021-02-01    a     2     1    30        0.06       0.03       0.90
3  2021-02-01    b     2     3    10        0.13       0.20       0.66
4  2021-03-01    a     2     1    30        0.06       0.03       0.90
5  2021-03-01    b     2     3    20        0.08       0.12       0.80

While I can do this in kind of a dumb way if I list these out individually...

>>> df.insert(5, "var1_share", df.apply(lambda x: x["var1"] / x[["var1", "var2", "var3"]].sum(), axis=1))
>>> df
         date c_id  var1  var2  var3  var1_share
0  2021-01-01    a     2     1    10    0.153846
1  2021-01-01    b     2     0    20    0.090909
2  2021-02-01    a     2     1    30    0.060606
3  2021-02-01    b     2     3    10    0.133333
4  2021-03-01    a     2     1    30    0.060606
5  2021-03-01    b     2     3    20    0.080000

What's the pandas magic for iterating this procedure over some list of valid columns, mylist= ["var1", "var2", "var3"]? I suspect there is an apply that can do this in a one-liner?

Also, pandas experts, what would this operation be called across columns of a dataframe? I'm sure this is common, but I'm not sure how I could have searched for it better.

like image 407
Mittenchops Avatar asked May 19 '21 19:05

Mittenchops


1 Answers

Try this:

cols = pd.Index(['var1', 'var2', 'var3'])
df[cols+'_share'] = df[cols].div(df.sum(axis=1), axis=0)

Output:

         date c_id  var1  var2  var3  var1_share  var2_share  var3_share
0  2021-01-01    a     2     1    10    0.153846    0.076923    0.769231
1  2021-01-01    b     2     0    20    0.090909    0.000000    0.909091
2  2021-02-01    a     2     1    30    0.060606    0.030303    0.909091
3  2021-02-01    b     2     3    10    0.133333    0.200000    0.666667
4  2021-03-01    a     2     1    30    0.060606    0.030303    0.909091
5  2021-03-01    b     2     3    20    0.080000    0.120000    0.800000

Let's use pandas intrinsic data alignment and pd.DataFrame.div with parameter axis=0 and pd.DataFrame.sum with axis=1.

like image 169
Scott Boston Avatar answered Oct 10 '22 06:10

Scott Boston