I need help transforming my data so I can read through transaction data.
Business Case
I'm trying to group together some related transactions to create some groups or classes of events. This data set represents workers going out on various leaves of absence events. I want to create one class of leaves based on any transaction falling within 365 days of the leave event class. For charting trends, I want to number the classes so I get a sequence/pattern.
My code allows me to see when the very first event occurred, and it can identify when a new class starts, but it doesn't bucket each transaction into a class.
Requirements:
I added in a column for the desired output, labeled as "Desired Output". Note, there can be many more rows/events per person; and there can be many more people.
Some Data
import pandas as pd
data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])
Some Code I've Tried
df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()
df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Effdt > 1 Yr?'] = ""
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes" ) & (df['Effdt Diff'] < -365)) ] = "Yes"
df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event"
df
Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .
If age >= 0 & age < 2 then AgeGroup = Infant If age >= 2 & age < 4 then AgeGroup = Toddler If age >= 4 & age < 13 then AgeGroup = Kid If age >= 13 & age < 20 then AgeGroup = Teen and so on .....
You can do this without having to loop or iterate through your dataframe. Per Wes McKinney you can use .apply()
with a groupBy object and define a function to apply to the groupby object. If you use this with .shift()
(like here) you can get your result without using any loops.
Terse example:
# Group by Employee ID
grouped = df.groupby("Employee ID")
# Define function
def get_unique_events(group):
# Convert to date and sort by date, like @Khris did
group["Effective Date"] = pd.to_datetime(group["Effective Date"])
group = group.sort_values("Effective Date")
event_series = (group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')).apply(lambda x: int(x)).cumsum()+1
return event_series
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))
df['Match'] = df['Desired Output'] == df['Output']
print(df)
Output:
Employee ID Effective Date Desired Output Unique Event \
3 100 2013-01-01 Unique Leave Event 1 1
2 100 2014-07-01 Unique Leave Event 2 2
1 100 2015-06-05 Unique Leave Event 2 2
0 100 2016-01-01 Unique Leave Event 2 2
6 200 2013-01-01 Unique Leave Event 1 1
5 200 2015-01-01 Unique Leave Event 2 2
4 200 2016-01-01 Unique Leave Event 2 2
7 300 2014-01 Unique Leave Event 1 1
Output Match
3 Unique Leave Event 1 True
2 Unique Leave Event 2 True
1 Unique Leave Event 2 True
0 Unique Leave Event 2 True
6 Unique Leave Event 1 True
5 Unique Leave Event 2 True
4 Unique Leave Event 2 True
7 Unique Leave Event 1 True
More verbose example for clarity:
import pandas as pd
data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])
# Group by Employee ID
grouped = df.groupby("Employee ID")
# Define a function to get the unique events
def get_unique_events(group):
# Convert to date and sort by date, like @Khris did
group["Effective Date"] = pd.to_datetime(group["Effective Date"])
group = group.sort_values("Effective Date")
# Define a series of booleans to determine whether the time between dates is over 365 days
# Use .shift(1) to look back one row
is_year = group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')
# Convert booleans to integers (0 for False, 1 for True)
is_year_int = is_year.apply(lambda x: int(x))
# Use the cumulative sum function in pandas to get the cumulative adjustment from the first date.
# Add one to start the first event as 1 instead of 0
event_series = is_year_int.cumsum() + 1
return event_series
# Run function on df and put results into a new dataframe
# Convert Employee ID back from an index to a column with .reset_index(level=0)
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)
# Merge the dataframes
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)
# Add string to match desired format
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))
# Check to see if output matches desired output
df['Match'] = df['Desired Output'] == df['Output']
print(df)
You get the same output:
Employee ID Effective Date Desired Output Unique Event \
3 100 2013-01-01 Unique Leave Event 1 1
2 100 2014-07-01 Unique Leave Event 2 2
1 100 2015-06-05 Unique Leave Event 2 2
0 100 2016-01-01 Unique Leave Event 2 2
6 200 2013-01-01 Unique Leave Event 1 1
5 200 2015-01-01 Unique Leave Event 2 2
4 200 2016-01-01 Unique Leave Event 2 2
7 300 2014-01 Unique Leave Event 1 1
Output Match
3 Unique Leave Event 1 True
2 Unique Leave Event 2 True
1 Unique Leave Event 2 True
0 Unique Leave Event 2 True
6 Unique Leave Event 1 True
5 Unique Leave Event 2 True
4 Unique Leave Event 2 True
7 Unique Leave Event 1 True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With