Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sum across columns and divide each cell from that value

People also ask

How do I sum values between columns in pandas?

sum() to Sum All Columns. Use DataFrame. sum() to get sum/total of a DataFrame for both rows and columns, to get the total sum of columns use axis=1 param. By default, this method takes axis=0 which means summing of rows.

How do you divide a column from another column in a data frame?

The second method to divide two columns is using the div() method. It divides the columns elementwise. It accepts a scalar value, series, or dataframe as an argument for dividing with the axis. If the axis is 0 the division is done row-wise and if the axis is 1 then division is done column-wise.

How do I get the sum of multiple columns in pandas?

Sum all columns in a Pandas DataFrame into new column If we want to summarize all the columns, then we can simply use the DataFrame sum() method.


More simply:

result.div(result.sum(axis=1), axis=0)

(Edited to use code highlighting)


Try the following:

In [1]: import pandas as pd

In [2]: df = pd.read_csv("test.csv")

In [3]: df
Out[3]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [4]: df["sum"] = df.sum(axis=1)

In [5]: df
Out[5]: 
  id  value1  value2  value3  sum
0  A       1       2       3    6
1  B       4       5       6   15
2  C       7       8       9   24

In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [7]: df_new
Out[7]: 
     value1    value2  value3
0  0.166667  0.333333   0.500
1  0.266667  0.333333   0.400
2  0.291667  0.333333   0.375

Or you can do the following:

In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)

In [9]: df
Out[9]: 
  id    value1    value2  value3  sum
0  A  0.166667  0.333333   0.500    6
1  B  0.266667  0.333333   0.400   15
2  C  0.291667  0.333333   0.375   24

Or just straight up from the beginning:

In [10]: df = pd.read_csv("test.csv")

In [11]: df
Out[11]: 
  id  value1  value2  value3
0  A       1       2       3
1  B       4       5       6
2  C       7       8       9

In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)

In [13]: df
Out[13]: 
  id    value1    value2  value3
0  A  0.166667  0.333333   0.500
1  B  0.266667  0.333333   0.400
2  C  0.291667  0.333333   0.375

Changing the column value1 and the like to your headers should work similarly.


easier to work per column:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T

result:

         0         1      2
0  0.166667  0.333333  0.500
1  0.266667  0.333333  0.400
2  0.291667  0.333333  0.375

The following seemed to work fine for me:

In [39]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols]  = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result

Out[39]:
                0         1         2         3         4         5         6  \
user_id                                                                         
2        0.864827  0.059749  0.023540  0.018503  0.022280  0.004806  0.000797   
4        0.837285  0.018345  0.049453  0.025258  0.052732  0.002437  0.004077   
16       0.912269  0.046174  0.017810  0.011214  0.011214  0.000660  0.000000   
50       0.754286  0.137143  0.064762  0.009524  0.034286  0.000000  0.000000   
51       0.401868  0.120099  0.041265  0.085403  0.286491  0.032437  0.001232   

                7         8         9        10        11        12        13  \
user_id                                                                         
2        0.000000  0.000154  0.000077  0.001079  0.001439  0.002364  0.000385   
4        0.000000  0.005406  0.001019  0.003456  0.000266  0.000177  0.000089   
16       0.000000  0.000000  0.000000  0.000660  0.000000  0.000000  0.000000   
50       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
51       0.000513  0.012113  0.005235  0.003285  0.000924  0.006364  0.002772   

        group  
user_id        
2         l-1  
4         l-2  
16        l-2  
50        l-3  
51        l-4  

OK scratch the above, the following will be much faster:

result[cols]  = result[cols].div(result[cols].sum(axis=1), axis=0)

And just to prove the result is the same:

In [47]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True

And that it's faster:

In [48]:

cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0) 
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop