Sample data:
0 1 2 3 4 Sum_max_3
0 591949 2575703 22479693 2202865 499835 27258261
1 2705 11426 339913 5438 1016 356777
2 18 119 4162 18 0 4299
3 264 1213 14999 246 116 16476
4 0 35 1292 10 0 1337
5 0 0 1442 0 0 1442
6 0 28 5596 20 0 5644
7 0 10 102 56 0 168
8 33 0 1224 17 0 1274
9 39 198 9505 62 35 9765
I want to get the sum of the 3 columns with the largest values in the row. These are different columns for every row (Sum_max_3).
I have many many columns so I need to do this automatically for all of them.
To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.
Use DataFrame. sum() to get sum/total of a DataFrame for both rows and columns, to get the total sum of columns use axis=1 param. By default, this method takes axis=0 which means summing of rows.
Add multiple columns to a data frame using Dataframe. assign() method. Using DataFrame. assign() method, we can set column names as parameters and pass values as list to replace/create the columns.
Use the underlying np
array (using df.values
), sort each row using np.sort
, and sum the last 3 columns (which will be the 3 maximum values for each row):
>>> df
0 1 2 3 4
0 591949 2575703 22479693 2202865 499835
1 2705 11426 339913 5438 1016
2 18 119 4162 18 0
3 264 1213 14999 246 116
4 0 35 1292 10 0
5 0 0 1442 0 0
6 0 28 5596 20 0
7 0 10 102 56 0
8 33 0 1224 17 0
9 39 198 9505 62 35
>>> df['Sum_max_3'] = np.sum(np.sort(df.values)[:,-3:],axis=1)
>>> df
0 1 2 3 4 Sum_max_3
0 591949 2575703 22479693 2202865 499835 27258261
1 2705 11426 339913 5438 1016 356777
2 18 119 4162 18 0 4299
3 264 1213 14999 246 116 16476
4 0 35 1292 10 0 1337
5 0 0 1442 0 0 1442
6 0 28 5596 20 0 5644
7 0 10 102 56 0 168
8 33 0 1224 17 0 1274
9 39 198 9505 62 35 9765
You should use numpy.partition
to avoid a full sort. This will reduce the time complexity of finding the three maximum values from O(nlogn)
to O(n)
, which will make a massive difference on large DataFrames:
np.sum(np.partition(df.values, -3)[:, -3:], 1)
df.assign(max3=np.sum(np.partition(df.values, -3)[:, -3:], 1))
0 1 2 3 4 max3
0 591949 2575703 22479693 2202865 499835 27258261
1 2705 11426 339913 5438 1016 356777
2 18 119 4162 18 0 4299
3 264 1213 14999 246 116 16476
4 0 35 1292 10 0 1337
5 0 0 1442 0 0 1442
6 0 28 5596 20 0 5644
7 0 10 102 56 0 168
8 33 0 1224 17 0 1274
9 39 198 9505 62 35 9765
Timings
In [411]: df = pd.DataFrame(np.random.rand(5000, 5000))
In [412]: %timeit np.sum(np.sort(df.values)[:,-3:],axis=1)
1.69 s ± 92.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [413]: %timeit np.sum(np.partition(df.values, -3)[:, -3:], 1)
364 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With