Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to groupby multiple columns and create a new column in Python based on thresholds

I have dataframe like below

Input

Invoice No  Date    Text            Vendor    Days
1000001     1/1/2020    Rent Payment    A   0
1000003     2/1/2020    Rent Payment    A   1
1000005     4/1/2020    Rent Payment    A   2
1000007     6/1/2020    Water payment   A   2
1000008     9/2/2020    Rep Payment     A   34
1000010     9/2/2020    Car Payment     A   0
1000011     10/2/2020   Car Payment     A   1
1000012     15/2/2020   Car Payment     A   5
1000013     16/2/2020   Car Payment     A   1
1000015     17/2/2020   Car Payment     A   1
1000002     1/1/2020    Rent Payment    B   -47
1000004     4/1/2020    Con Payment     B   3
1000006     6/1/2020    Con Payment     B   2
1000009     9/2/2020    Water payment   B   34
1000014    17/2/2020    Test Payment    B   8
1000016    19/2/2020    Test Payment    B   2

Condition

How write python condition which check the description,vendor name and days column and if the description,vendor name is the same and the no of days is <=2 then the rows should be grouped together under common group name say (G1) all the other rows can be assigned a unique group name.All grouped rows should have unique group name as shown in the output

Expected Output

Invoice No  Date        Text          Vendor   Days    Group
1000001     1/1/2020    Rent Payment    A       0        G1
1000003     2/1/2020    Rent Payment    A       1        G1
1000005     4/1/2020    Rent Payment    A       2        G1
1000007     6/1/2020    Water payment   A       2        G2
1000008     9/2/2020    Rep Payment     A       34       G3
1000010     9/2/2020    Car Payment     A       0        G4
1000011    10/2/2020    Car Payment     A       1        G4
1000012    15/2/2020    Car Payment     A       5        G5
1000013    16/2/2020    Car Payment     A       1        G5
1000015    17/2/2020    Car Payment     A       1        G5
1000002    1/1/2020     Rent Payment    B      -47       G6
1000004    4/1/2020     Con Payment     B       3        G7
1000006    6/1/2020     Con Payment     B       2        G7
1000009    9/2/2020     Water payment   B      34        G8
1000014    17/2/2020    Test Payment    B       8        G9
1000016    19/2/2020    Test Payment    B       2        G9
like image 808
Rahul rajan Avatar asked Dec 17 '22 13:12

Rahul rajan


1 Answers

You need to use groupby on three items: 'Text', 'Vendor', and a boolean representation of whether 'Days' changes by more than 2 within the groups defined by ['Text', 'Vendor'] alone.

After that, you need to name the unique groups. I provide two methods below.

ngroup

f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = df.groupby(['Text', 'Vendor', d], sort=False).ngroup()
df.assign(Group=g.add(1).astype(str).radd('G'))

    Invoice No       Date           Text Vendor  Days Group
0      1000001   1/1/2020   Rent Payment      A     0    G1
1      1000003   2/1/2020   Rent Payment      A     1    G1
2      1000005   4/1/2020   Rent Payment      A     2    G1
3      1000007   6/1/2020  Water payment      A     2    G2
4      1000008   9/2/2020    Rep Payment      A    34    G3
5      1000010   9/2/2020    Car Payment      A     0    G4
6      1000011  10/2/2020    Car Payment      A     1    G4
7      1000012  15/2/2020    Car Payment      A     5    G5
8      1000013  16/2/2020    Car Payment      A     1    G5
9      1000015  17/2/2020    Car Payment      A     1    G5
10     1000002   1/1/2020   Rent Payment      B   -47    G6
11     1000004   4/1/2020    Con Payment      B     3    G7
12     1000006   6/1/2020    Con Payment      B     2    G7
13     1000009   9/2/2020  Water payment      B    34    G8
14     1000014  17/2/2020   Test Payment      B     8    G9
15     1000016  19/2/2020   Test Payment      B     2    G9

factorize

f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = pd.factorize([*zip(df.Text, df.Vendor, d)])[0]
df.assign(Group=[f'G{i + 1}' for i in g])

    Invoice No       Date           Text Vendor  Days Group
0      1000001   1/1/2020   Rent Payment      A     0    G1
1      1000003   2/1/2020   Rent Payment      A     1    G1
2      1000005   4/1/2020   Rent Payment      A     2    G1
3      1000007   6/1/2020  Water payment      A     2    G2
4      1000008   9/2/2020    Rep Payment      A    34    G3
5      1000010   9/2/2020    Car Payment      A     0    G4
6      1000011  10/2/2020    Car Payment      A     1    G4
7      1000012  15/2/2020    Car Payment      A     5    G5
8      1000013  16/2/2020    Car Payment      A     1    G5
9      1000015  17/2/2020    Car Payment      A     1    G5
10     1000002   1/1/2020   Rent Payment      B   -47    G6
11     1000004   4/1/2020    Con Payment      B     3    G7
12     1000006   6/1/2020    Con Payment      B     2    G7
13     1000009   9/2/2020  Water payment      B    34    G8
14     1000014  17/2/2020   Test Payment      B     8    G9
15     1000016  19/2/2020   Test Payment      B     2    G9

Some Details

#        The first element of group    Cumulatively summing True/False
#        will get NaN so we fill it    will create a new value every time
#        in with 0         ║           we see a True.  This creates groups
#                          ║               ║     
#         adjacent differences   Should be obvious
#               ╭─┴──╮ ╭───╨───╮ ╭─┴─╮ ╭───╨──╮
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
like image 187
piRSquared Avatar answered May 26 '23 22:05

piRSquared