I have dataframe like below
Input
Invoice No Date Text Vendor Days
1000001 1/1/2020 Rent Payment A 0
1000003 2/1/2020 Rent Payment A 1
1000005 4/1/2020 Rent Payment A 2
1000007 6/1/2020 Water payment A 2
1000008 9/2/2020 Rep Payment A 34
1000010 9/2/2020 Car Payment A 0
1000011 10/2/2020 Car Payment A 1
1000012 15/2/2020 Car Payment A 5
1000013 16/2/2020 Car Payment A 1
1000015 17/2/2020 Car Payment A 1
1000002 1/1/2020 Rent Payment B -47
1000004 4/1/2020 Con Payment B 3
1000006 6/1/2020 Con Payment B 2
1000009 9/2/2020 Water payment B 34
1000014 17/2/2020 Test Payment B 8
1000016 19/2/2020 Test Payment B 2
Condition
How write python condition which check the description,vendor name and days column and if the description,vendor name is the same and the no of days is <=2 then the rows should be grouped together under common group name say (G1) all the other rows can be assigned a unique group name.All grouped rows should have unique group name as shown in the output
Expected Output
Invoice No Date Text Vendor Days Group
1000001 1/1/2020 Rent Payment A 0 G1
1000003 2/1/2020 Rent Payment A 1 G1
1000005 4/1/2020 Rent Payment A 2 G1
1000007 6/1/2020 Water payment A 2 G2
1000008 9/2/2020 Rep Payment A 34 G3
1000010 9/2/2020 Car Payment A 0 G4
1000011 10/2/2020 Car Payment A 1 G4
1000012 15/2/2020 Car Payment A 5 G5
1000013 16/2/2020 Car Payment A 1 G5
1000015 17/2/2020 Car Payment A 1 G5
1000002 1/1/2020 Rent Payment B -47 G6
1000004 4/1/2020 Con Payment B 3 G7
1000006 6/1/2020 Con Payment B 2 G7
1000009 9/2/2020 Water payment B 34 G8
1000014 17/2/2020 Test Payment B 8 G9
1000016 19/2/2020 Test Payment B 2 G9
You need to use groupby
on three items: 'Text'
, 'Vendor'
, and a boolean representation of whether 'Days'
changes by more than 2
within the groups defined by ['Text', 'Vendor']
alone.
After that, you need to name the unique groups. I provide two methods below.
ngroup
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = df.groupby(['Text', 'Vendor', d], sort=False).ngroup()
df.assign(Group=g.add(1).astype(str).radd('G'))
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
factorize
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = pd.factorize([*zip(df.Text, df.Vendor, d)])[0]
df.assign(Group=[f'G{i + 1}' for i in g])
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
# The first element of group Cumulatively summing True/False
# will get NaN so we fill it will create a new value every time
# in with 0 ║ we see a True. This creates groups
# ║ ║
# adjacent differences Should be obvious
# ╭─┴──╮ ╭───╨───╮ ╭─┴─╮ ╭───╨──╮
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With