Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allocate values from different options - pandas

Tags:

python

pandas

I have a df that contains numerous Places at recurring time periods. These Places are beginning and finishing in random fashion. For each time period, I want to assign each unique place to a Group. The central rules in doing this are:

1) Each Group can hold no more than 3 unique Places at any one time

2) Unique Places should be evenly distributed across each Group

I've taken a very small subsection of the df. There are 7 unique values (but no more than 5 occuring at any one time) and 2 Groups to choose from. But in practice, the df could contain up to 50 unique values in total that finish and end and varying time periods that will be distributed across a maximum of 6 Groups.

To understand how many Places are currently occuring I've included a Total, which is based on if the Place appears again.

The df contains all available Groups for each unique Place at each Period. Places Golf and Club will finish but we assume all other places are continued as they appear later in the df.

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6],  
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS'],                                
    'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5],                            
    'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1'],                           
    })

The main issue that's causing me trouble is Places appear/exist dynamically. In that, they finish and new ones begin in a random fashion. So assigning and distributing the current unique Places needs to account for this concept

Attempt:

def AssignPlace(df):
        uniquePlaces = df['Place'].unique()
        G3 = dict(zip(uniquePlaces, np.arange(len(uniquePlaces)) // 3 + 1))
        df['Assigned Group'] = df['Place'].map(G3)
        return df

df = df.groupby('Available Group', sort=False).apply(AssignPlace)
df = df.drop_duplicates(subset = ['Period','Place'])

Out:

    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1             
1   2       CLUB    1      2               1             
3   2       HOME    2      2               1             
5   2       AWAY    3      2               1             
7   3       WORK    4      2               2             
9   3       AWAY    4      1               1             
11  3       GOLF    5      1               2  #GOLF FINISHES SO 4 OCCURING FROM NEXT ROW            
13  4       CLUB    4      2               1  #CLUB FINISHES BUT POOL STARTS SO STILL 4 OCCURING FROM NEXT ROW           
15  4       POOL    4      2               2             
17  4       HOME    4      2               1             
19  5       WORK    4      2               2             
21  5       AWAY    4      1               1             
23  5       POOL    4      1               2             
25  6       TENNIS  5      2               3  #Signifies issue

The last row displays the start of the issue. The assigned group measures this place correctly as the 7th unique value but it doesn't account for current unique values. As Club and Golf finish, they are only 5 current unqiue values and 2 available groups. But it's returning Group 3. So each new unique value will continue to be counted, rather than account for currently occuring unique values.

Intended Output, TENNIS Assigned Group is now 1, instead of 3:

    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1             
1   2       CLUB    1      2               1             
3   2       HOME    2      2               1             
5   2       AWAY    3      2               1             
7   3       WORK    4      2               2             
9   3       AWAY    4      1               1             
11  3       GOLF    5      1               2             
13  4       CLUB    4      2               1             
15  4       POOL    4      2               2             
17  4       HOME    4      2               1             
19  5       WORK    4      2               2             
21  5       AWAY    4      1               1             
23  5       POOL    4      1               2             
25  6       TENNIS  5      2               1 
like image 435
jonboy Avatar asked Mar 04 '23 09:03

jonboy


1 Answers

Here is my try. Explanation are on code comment, if not enough leave me a comment here

NOTE: I have added 5 dummy rows at bottom to simulate that those places will appear latter in df. So please ignore rows with period=0

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,0,0,0,0,0],  
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS', "AWAY","HOME","POOL","WORK", "TENNIS"],                                
#     'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5,0,0,0,0,0],                            
#     'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1',0,0,0,0,0],                           
    })

# df to store all unique places
uniquePlaces = pd.DataFrame(df["Place"].unique(), columns=["Place"])
# Start stores index of df where the place appears 1st
uniquePlaces["Start"] = -1
# End stores index of df where the place appears last 
uniquePlaces["End"] = -1

## adds new column "Place Label" which is label encoded value for a place
## "Place Label" may not be necessary but it may improve performance when looking up and merging
## this function also updates Start and End of current label in group
def assign_place_label(group):
    label=uniquePlaces[uniquePlaces["Place"]==group.name].index[0]
    group["Place Label"] = label
    uniquePlaces.loc[label, "Start"] = group.index.min()
    uniquePlaces.loc[label, "End"] = group.index.max()
    return group

## based on Start and End of each place assign index to each place
## when a freed the index is reused to new place appearing after that
def get_dynamic_group(up):
    up["Index"] = 0
    up["Freed"] = False
    max_ind=0
    free_indx = []
    for i in range(len(up)):
        ind_freed = up.index[(up["End"]<up.iloc[i]["Start"]) & (~up["Freed"])]
        free = list(up.loc[ind_freed, "Index"])
        free_indx += free

        up.loc[ind_freed, "Freed"] = True


        if len(free_indx)>0:
            m = min(free_indx)
            up.loc[i, "Index"] = m
            free_indx.remove(m)

        else:
            up.loc[i, "Index"] = max_ind
            max_ind+=1

    up["Group"] = up["Index"]//3+1

    return up  

df2 = df.groupby("Place").apply(assign_place_label)
uniquePlaces = get_dynamic_group(uniquePlaces)

display(uniquePlaces)

df3 = df2[df2.Period!=0].drop_duplicates(subset = ['Period','Place'])
result = df3.merge(uniquePlaces[["Group"]], how="left", left_on="Place Label", 
                   right_index=True, sort=False)
display(result)

Output

    Period  Place   Place Label Group
0   1   CLUB    0   1
1   2   CLUB    0   1
3   2   HOME    1   1
5   2   AWAY    2   1
7   3   WORK    3   2
9   3   AWAY    2   1
11  3   GOLF    4   2
13  4   CLUB    0   1
15  4   POOL    5   2
17  4   HOME    1   1
19  5   WORK    3   2
21  5   AWAY    2   1
23  5   POOL    5   2
25  6   TENNIS  6   1
like image 136
Dev Khadka Avatar answered Mar 12 '23 21:03

Dev Khadka