I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.
Example-
print(parent_df)
school location modifed_date
0 school_1 New Delhi 2020-04-06
1 school_2 Kolkata 2020-04-06
2 school_3 Bengaluru 2020-04-06
3 school_4 Mumbai 2020-04-06
4 school_5 Chennai 2020-04-06
print(location_df)
school location
0 school_10 New Delhi
1 school_20 Kolkata
2 school_30 Bengaluru
3 school_40 Mumbai
4 school_50 Chennai
As per this use case, I need to transform the school names present in parent_df
, based on the location
column present in the same df, and the location property present in location_df
To achieve this transformation, I wrote the following method.
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
And this is how I am calling this method
parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)
The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?
Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace()
method as mentioned in one of the answers below.
print(parent_df)
school location modifed_date type
0 school_1 _pre_New Delhi_post 2020-04-06 Govt
1 school_2 _pre_Kolkata_post 2020-04-06 Private
2 school_3 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
print(location_df)
school location type
0 school_10 New Delhi Govt
1 school_20 Kolkata Private
2 school_30 Bengaluru Private
Custom Method code
def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
This is the actual scenario what I need to handle, so using replace()
method won't help.
You can use map/replace
:
parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])
Output:
school location modifed_date
0 school_10 New Delhi 2020-04-06
1 school_20 Kolkata 2020-04-06
2 school_30 Bengaluru 2020-04-06
3 school_40 Mumbai 2020-04-06
4 school_50 Chennai 2020-04-06
IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.
pat = '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?'
mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])
parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df
school location modifed_date type
0 school_10 _pre_New Delhi_post 2020-04-06 Govt
1 school_20 _pre_Kolkata_post 2020-04-06 Private
2 school_30 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With