Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas group by and sum, but create a new row when a certain amount is exceeded

I currently have a data set where im trying to group up rows based on a column and sum the columns where the values are integers.

However, the catch is I would like to create a new row once the sum has reached a certain threshhold

For example, in the below dataframe, I am trying to group the rows based on company name and sum up the weights, however, I do not want weight to exceed 100.

Input dataframe:

Company Weight
a 30
b 45
a 27
a 40
b 57
a 57
b 32

Output dataframe:

Company Weight
a 97
a 57
b 89
b 45

I have tried using group by and sum, however, it cannot detect whether or not I have reached a maximum amount.

Is there any way I can achieve this?

Any help would be greatly appreciated!

like image 800
ChrisHo1341 Avatar asked Jun 03 '21 05:06

ChrisHo1341


People also ask

How do I create a new column from the output of pandas Groupby () sum ()?

One of the simplest methods on groupby objects is the sum() method. To create a new column for the output of groupby. sum(), we will first apply the groupby. sim() operation and then we will store this result in a new column.

How do I sum only certain rows in pandas?

Practical Data Science using Python To sum only specific rows, use the loc() method. Mention the beginning and end row index using the : operator. Using loc(), you can also set the columns to be included. We can display the result in a new column.

How do you sum in Groupby?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.


2 Answers

I think here are necessary loops, so for improve performance is use numba, modified solution from Divakar, called function per groups by GroupBy.transform and then aggregate sum:

from numba import njit

@njit
def make_groups(x, target):
    result = np.empty(len(x),dtype=np.uint64)
    total = 0
    group = 0
    for i,x_i in enumerate(x):
        total += x_i
        if total >= target:
            group += 1
            total = 0
        result[i] = group
    return result

g = df.groupby("Company")["Weight"].transform(lambda x: make_groups(x.to_numpy(), 100))

df1 = (df.groupby(by=["Company", g])
        .sum()
        .reset_index(1, drop=True)
        .sort_values(['Company','Weight'], ascending=[True, False])
        .reset_index())
print (df1)
  Company  Weight
0       a      97
1       a      57
2       b      89
3       b      45
like image 186
jezrael Avatar answered Oct 23 '22 01:10

jezrael


well, it depends, you're asking an NP problem currently unless you don't want the optimum weight in under 100, there are a few algoritems you can do,

but none are o(n) which is what group by and the sum does, lets say you iterate with iterrows() (try to avoid that), would you be able to do so in one iteration? if you are not looking for an optimum solution (closest to 100 each match) there is an option.

for every company, you have to sort it by increasing values. using iteration to open a new row every time sum is reaching a 100, at a side variable, and replacing the origin at the end

There isn't a pandas / Numpy standard solution that I know of.

like image 1
masasa Avatar answered Oct 23 '22 00:10

masasa