I have a df
that contains numerous Places
at recurring time periods. These Places
are beginning and finishing in random fashion. For each time period, I want to assign each unique place to a Group
. The central rules in doing this are:
1) Each Group
can hold no more than 3 unique Places
at any one time
2) Unique Places
should be evenly distributed across each Group
I've taken a very small subsection of the df
. There are 7 unique values (but no more than 5 occuring at any one time) and 2 Groups
to choose from. But in practice, the df
could contain up to 50 unique values in total that finish and end and varying time periods that will be distributed across a maximum of 6 Groups.
To understand how many Places
are currently occuring I've included a Total
, which is based on if the Place
appears again.
The df
contains all available Groups
for each unique Place
at each Period
. Places Golf
and Club
will finish but we assume all other places are continued as they appear later in the df
.
df = pd.DataFrame({
'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6],
'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS'],
'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5],
'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1'],
})
The main issue that's causing me trouble is Places
appear/exist dynamically. In that, they finish and new ones begin in a random fashion. So assigning and distributing the current unique Places
needs to account for this concept
Attempt:
def AssignPlace(df):
uniquePlaces = df['Place'].unique()
G3 = dict(zip(uniquePlaces, np.arange(len(uniquePlaces)) // 3 + 1))
df['Assigned Group'] = df['Place'].map(G3)
return df
df = df.groupby('Available Group', sort=False).apply(AssignPlace)
df = df.drop_duplicates(subset = ['Period','Place'])
Out:
Period Place Total Available Group Assigned Group
0 1 CLUB 1 1 1
1 2 CLUB 1 2 1
3 2 HOME 2 2 1
5 2 AWAY 3 2 1
7 3 WORK 4 2 2
9 3 AWAY 4 1 1
11 3 GOLF 5 1 2 #GOLF FINISHES SO 4 OCCURING FROM NEXT ROW
13 4 CLUB 4 2 1 #CLUB FINISHES BUT POOL STARTS SO STILL 4 OCCURING FROM NEXT ROW
15 4 POOL 4 2 2
17 4 HOME 4 2 1
19 5 WORK 4 2 2
21 5 AWAY 4 1 1
23 5 POOL 4 1 2
25 6 TENNIS 5 2 3 #Signifies issue
The last row displays the start of the issue. The assigned group measures this place correctly as the 7th unique value but it doesn't account for current unique values. As Club
and Golf
finish, they are only 5 current unqiue values and 2 available groups. But it's returning Group 3
. So each new unique value will continue to be counted, rather than account for currently occuring unique values.
Intended Output, TENNIS
Assigned Group is now 1
, instead of 3
:
Period Place Total Available Group Assigned Group
0 1 CLUB 1 1 1
1 2 CLUB 1 2 1
3 2 HOME 2 2 1
5 2 AWAY 3 2 1
7 3 WORK 4 2 2
9 3 AWAY 4 1 1
11 3 GOLF 5 1 2
13 4 CLUB 4 2 1
15 4 POOL 4 2 2
17 4 HOME 4 2 1
19 5 WORK 4 2 2
21 5 AWAY 4 1 1
23 5 POOL 4 1 2
25 6 TENNIS 5 2 1
Here is my try. Explanation are on code comment, if not enough leave me a comment here
NOTE: I have added 5 dummy rows at bottom to simulate that those places will appear latter in df. So please ignore rows with period=0
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,0,0,0,0,0],
'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS', "AWAY","HOME","POOL","WORK", "TENNIS"],
# 'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5,0,0,0,0,0],
# 'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1',0,0,0,0,0],
})
# df to store all unique places
uniquePlaces = pd.DataFrame(df["Place"].unique(), columns=["Place"])
# Start stores index of df where the place appears 1st
uniquePlaces["Start"] = -1
# End stores index of df where the place appears last
uniquePlaces["End"] = -1
## adds new column "Place Label" which is label encoded value for a place
## "Place Label" may not be necessary but it may improve performance when looking up and merging
## this function also updates Start and End of current label in group
def assign_place_label(group):
label=uniquePlaces[uniquePlaces["Place"]==group.name].index[0]
group["Place Label"] = label
uniquePlaces.loc[label, "Start"] = group.index.min()
uniquePlaces.loc[label, "End"] = group.index.max()
return group
## based on Start and End of each place assign index to each place
## when a freed the index is reused to new place appearing after that
def get_dynamic_group(up):
up["Index"] = 0
up["Freed"] = False
max_ind=0
free_indx = []
for i in range(len(up)):
ind_freed = up.index[(up["End"]<up.iloc[i]["Start"]) & (~up["Freed"])]
free = list(up.loc[ind_freed, "Index"])
free_indx += free
up.loc[ind_freed, "Freed"] = True
if len(free_indx)>0:
m = min(free_indx)
up.loc[i, "Index"] = m
free_indx.remove(m)
else:
up.loc[i, "Index"] = max_ind
max_ind+=1
up["Group"] = up["Index"]//3+1
return up
df2 = df.groupby("Place").apply(assign_place_label)
uniquePlaces = get_dynamic_group(uniquePlaces)
display(uniquePlaces)
df3 = df2[df2.Period!=0].drop_duplicates(subset = ['Period','Place'])
result = df3.merge(uniquePlaces[["Group"]], how="left", left_on="Place Label",
right_index=True, sort=False)
display(result)
Output
Period Place Place Label Group
0 1 CLUB 0 1
1 2 CLUB 0 1
3 2 HOME 1 1
5 2 AWAY 2 1
7 3 WORK 3 2
9 3 AWAY 2 1
11 3 GOLF 4 2
13 4 CLUB 0 1
15 4 POOL 5 2
17 4 HOME 1 1
19 5 WORK 3 2
21 5 AWAY 2 1
23 5 POOL 5 2
25 6 TENNIS 6 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With