I currently have a data set where im trying to group up rows based on a column and sum the columns where the values are integers.
However, the catch is I would like to create a new row once the sum has reached a certain threshhold
For example, in the below dataframe, I am trying to group the rows based on company name and sum up the weights, however, I do not want weight to exceed 100.
Input dataframe:
Company | Weight |
---|---|
a | 30 |
b | 45 |
a | 27 |
a | 40 |
b | 57 |
a | 57 |
b | 32 |
Output dataframe:
Company | Weight |
---|---|
a | 97 |
a | 57 |
b | 89 |
b | 45 |
I have tried using group by and sum, however, it cannot detect whether or not I have reached a maximum amount.
Is there any way I can achieve this?
Any help would be greatly appreciated!
One of the simplest methods on groupby objects is the sum() method. To create a new column for the output of groupby. sum(), we will first apply the groupby. sim() operation and then we will store this result in a new column.
Practical Data Science using Python To sum only specific rows, use the loc() method. Mention the beginning and end row index using the : operator. Using loc(), you can also set the columns to be included. We can display the result in a new column.
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
I think here are necessary loops, so for improve performance is use numba
, modified solution from Divakar, called function per groups by GroupBy.transform
and then aggregate sum
:
from numba import njit
@njit
def make_groups(x, target):
result = np.empty(len(x),dtype=np.uint64)
total = 0
group = 0
for i,x_i in enumerate(x):
total += x_i
if total >= target:
group += 1
total = 0
result[i] = group
return result
g = df.groupby("Company")["Weight"].transform(lambda x: make_groups(x.to_numpy(), 100))
df1 = (df.groupby(by=["Company", g])
.sum()
.reset_index(1, drop=True)
.sort_values(['Company','Weight'], ascending=[True, False])
.reset_index())
print (df1)
Company Weight
0 a 97
1 a 57
2 b 89
3 b 45
well, it depends, you're asking an NP problem currently unless you don't want the optimum weight in under 100, there are a few algoritems you can do,
but none are o(n) which is what group by and the sum does, lets say you iterate with iterrows() (try to avoid that), would you be able to do so in one iteration? if you are not looking for an optimum solution (closest to 100 each match) there is an option.
for every company, you have to sort it by increasing values. using iteration to open a new row every time sum is reaching a 100, at a side variable, and replacing the origin at the end
There isn't a pandas / Numpy standard solution that I know of.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With