Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Sum of the Max 3 Column Values in Each Row

Tags:

python

pandas

Sample data:

        0        1         2        3       4  Sum_max_3
0  591949  2575703  22479693  2202865  499835   27258261
1    2705    11426    339913     5438    1016     356777
2      18      119      4162       18       0       4299
3     264     1213     14999      246     116      16476
4       0       35      1292       10       0       1337
5       0        0      1442        0       0       1442
6       0       28      5596       20       0       5644
7       0       10       102       56       0        168
8      33        0      1224       17       0       1274
9      39      198      9505       62      35       9765

I want to get the sum of the 3 columns with the largest values in the row. These are different columns for every row (Sum_max_3).

I have many many columns so I need to do this automatically for all of them.

like image 222
user1581390 Avatar asked Sep 16 '18 20:09

user1581390


People also ask

How do I sum across rows in pandas?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.

How do I get the sum of multiple columns in pandas?

Use DataFrame. sum() to get sum/total of a DataFrame for both rows and columns, to get the total sum of columns use axis=1 param. By default, this method takes axis=0 which means summing of rows.

How do I add 3 columns in pandas?

Add multiple columns to a data frame using Dataframe. assign() method. Using DataFrame. assign() method, we can set column names as parameters and pass values as list to replace/create the columns.


2 Answers

Use the underlying np array (using df.values), sort each row using np.sort, and sum the last 3 columns (which will be the 3 maximum values for each row):

>>> df
        0        1         2        3       4
0  591949  2575703  22479693  2202865  499835
1    2705    11426    339913     5438    1016
2      18      119      4162       18       0
3     264     1213     14999      246     116
4       0       35      1292       10       0
5       0        0      1442        0       0
6       0       28      5596       20       0
7       0       10       102       56       0
8      33        0      1224       17       0
9      39      198      9505       62      35


>>> df['Sum_max_3'] = np.sum(np.sort(df.values)[:,-3:],axis=1)

>>> df
        0        1         2        3       4  Sum_max_3
0  591949  2575703  22479693  2202865  499835   27258261
1    2705    11426    339913     5438    1016     356777
2      18      119      4162       18       0       4299
3     264     1213     14999      246     116      16476
4       0       35      1292       10       0       1337
5       0        0      1442        0       0       1442
6       0       28      5596       20       0       5644
7       0       10       102       56       0        168
8      33        0      1224       17       0       1274
9      39      198      9505       62      35       9765
like image 31
sacuL Avatar answered Sep 29 '22 09:09

sacuL


You should use numpy.partition to avoid a full sort. This will reduce the time complexity of finding the three maximum values from O(nlogn) to O(n), which will make a massive difference on large DataFrames:

np.sum(np.partition(df.values, -3)[:, -3:], 1)

df.assign(max3=np.sum(np.partition(df.values, -3)[:, -3:], 1))

        0        1         2        3       4      max3
0  591949  2575703  22479693  2202865  499835  27258261
1    2705    11426    339913     5438    1016    356777
2      18      119      4162       18       0      4299
3     264     1213     14999      246     116     16476
4       0       35      1292       10       0      1337
5       0        0      1442        0       0      1442
6       0       28      5596       20       0      5644
7       0       10       102       56       0       168
8      33        0      1224       17       0      1274
9      39      198      9505       62      35      9765

Timings

In [411]: df = pd.DataFrame(np.random.rand(5000, 5000))

In [412]: %timeit np.sum(np.sort(df.values)[:,-3:],axis=1)
1.69 s ± 92.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [413]: %timeit np.sum(np.partition(df.values, -3)[:, -3:], 1)
364 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 63
user3483203 Avatar answered Sep 29 '22 09:09

user3483203