Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create groups/classes based on conditions within columns

Tags:

python

pandas

I need help transforming my data so I can read through transaction data.

Business Case

I'm trying to group together some related transactions to create some groups or classes of events. This data set represents workers going out on various leaves of absence events. I want to create one class of leaves based on any transaction falling within 365 days of the leave event class. For charting trends, I want to number the classes so I get a sequence/pattern.

My code allows me to see when the very first event occurred, and it can identify when a new class starts, but it doesn't bucket each transaction into a class.

Requirements:

  • Tag all rows based on what leave class they fall into.
  • Number each Unique Leave Event. Using this example index 0 would be Unique Leave Event 2, index 1 would be Unique Leave Event 2, index 3 would be Unique Leave Event 2, AND index 4 would be Unique Leave Event 1, etc.

I added in a column for the desired output, labeled as "Desired Output". Note, there can be many more rows/events per person; and there can be many more people.

Some Data

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

Some Code I've Tried

df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()


df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']]  = "Yes"

df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']]  = "Yes"

df['Effdt > 1 Yr?'] = ""                                        
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes" ) & (df['Effdt Diff'] < -365))  ] = "Yes"

df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event" 

df
like image 557
Christopher Avatar asked Sep 26 '16 21:09

Christopher


People also ask

How do I group values in a column in pandas?

Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

How do I group values in a column in Python?

You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .

How do you categorize age groups in Python?

If age >= 0 & age < 2 then AgeGroup = Infant If age >= 2 & age < 4 then AgeGroup = Toddler If age >= 4 & age < 13 then AgeGroup = Kid If age >= 13 & age < 20 then AgeGroup = Teen and so on .....


1 Answers

You can do this without having to loop or iterate through your dataframe. Per Wes McKinney you can use .apply() with a groupBy object and define a function to apply to the groupby object. If you use this with .shift() (like here) you can get your result without using any loops.

Terse example:

# Group by Employee ID
grouped = df.groupby("Employee ID")
# Define function 
def get_unique_events(group):
    # Convert to date and sort by date, like @Khris did
    group["Effective Date"] = pd.to_datetime(group["Effective Date"])
    group = group.sort_values("Effective Date")
    event_series = (group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')).apply(lambda x: int(x)).cumsum()+1
    return event_series

event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))
df['Match'] = df['Desired Output'] == df['Output']

print(df)

Output:

  Employee ID Effective Date        Desired Output  Unique Event  \
3         100     2013-01-01  Unique Leave Event 1             1
2         100     2014-07-01  Unique Leave Event 2             2
1         100     2015-06-05  Unique Leave Event 2             2
0         100     2016-01-01  Unique Leave Event 2             2
6         200     2013-01-01  Unique Leave Event 1             1
5         200     2015-01-01  Unique Leave Event 2             2
4         200     2016-01-01  Unique Leave Event 2             2
7         300        2014-01  Unique Leave Event 1             1

                 Output Match
3  Unique Leave Event 1  True
2  Unique Leave Event 2  True
1  Unique Leave Event 2  True
0  Unique Leave Event 2  True
6  Unique Leave Event 1  True
5  Unique Leave Event 2  True
4  Unique Leave Event 2  True
7  Unique Leave Event 1  True

More verbose example for clarity:

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

# Group by Employee ID
grouped = df.groupby("Employee ID")

# Define a function to get the unique events
def get_unique_events(group):
     # Convert to date and sort by date, like @Khris did
    group["Effective Date"] = pd.to_datetime(group["Effective Date"])
    group = group.sort_values("Effective Date")
    # Define a series of booleans to determine whether the time between dates is over 365 days
    # Use .shift(1) to look back one row
    is_year = group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')
    # Convert booleans to integers (0 for False, 1 for True)
    is_year_int = is_year.apply(lambda x: int(x))    
    # Use the cumulative sum function in pandas to get the cumulative adjustment from the first date.
    # Add one to start the first event as 1 instead of 0
    event_series = is_year_int.cumsum() + 1
    return event_series

# Run function on df and put results into a new dataframe
# Convert Employee ID back from an index to a column with .reset_index(level=0)
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)

# Merge the dataframes
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)

# Add string to match desired format
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))

# Check to see if output matches desired output
df['Match'] = df['Desired Output'] == df['Output']

print(df)

You get the same output:

  Employee ID Effective Date        Desired Output  Unique Event  \
3         100     2013-01-01  Unique Leave Event 1             1
2         100     2014-07-01  Unique Leave Event 2             2
1         100     2015-06-05  Unique Leave Event 2             2
0         100     2016-01-01  Unique Leave Event 2             2
6         200     2013-01-01  Unique Leave Event 1             1
5         200     2015-01-01  Unique Leave Event 2             2
4         200     2016-01-01  Unique Leave Event 2             2
7         300        2014-01  Unique Leave Event 1             1

                 Output Match
3  Unique Leave Event 1  True
2  Unique Leave Event 2  True
1  Unique Leave Event 2  True
0  Unique Leave Event 2  True
6  Unique Leave Event 1  True
5  Unique Leave Event 2  True
4  Unique Leave Event 2  True
7  Unique Leave Event 1  True
like image 138
Andrew Avatar answered Sep 28 '22 13:09

Andrew