Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Improve Performance of apply method

I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.

Example-

print(parent_df)
       school         location      modifed_date
0      school_1       New Delhi     2020-04-06
1      school_2       Kolkata       2020-04-06
2      school_3       Bengaluru     2020-04-06
3      school_4       Mumbai        2020-04-06
4      school_5       Chennai       2020-04-06

print(location_df)
       school          location     
0      school_10       New Delhi
1      school_20       Kolkata     
2      school_30       Bengaluru
3      school_40       Mumbai       
4      school_50       Chennai

As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df

To achieve this transformation, I wrote the following method.

def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

And this is how I am calling this method

parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)

The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?

EDITED

Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.

print(parent_df)
       school         location                  modifed_date    type
0      school_1       _pre_New Delhi_post       2020-04-06      Govt
1      school_2       _pre_Kolkata_post         2020-04-06      Private
2      school_3       _pre_Bengaluru_post       2020-04-06      Private
3      school_4       _pre_Mumbai_post          2020-04-06      Govt
4      school_5       _pre_Chennai_post         2020-04-06      Private

print(location_df)
           school          location     type
    0      school_10       New Delhi    Govt
    1      school_20       Kolkata      Private
    2      school_30       Bengaluru    Private

Custom Method code

def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
    return location_df.school.iloc[0]
else:
    return row['school']


def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

This is the actual scenario what I need to handle, so using replace() method won't help.

like image 756
Mitaksh Gupta Avatar asked Jan 25 '23 01:01

Mitaksh Gupta


2 Answers

You can use map/replace:

parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])

Output:

      school   location modifed_date
0  school_10  New Delhi   2020-04-06
1  school_20    Kolkata   2020-04-06
2  school_30  Bengaluru   2020-04-06
3  school_40     Mumbai   2020-04-06
4  school_50    Chennai   2020-04-06
like image 187
Quang Hoang Avatar answered Jan 27 '23 15:01

Quang Hoang


IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.

pat =  '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?' 

mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])

parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df


    school      location            modifed_date    type
0   school_10   _pre_New Delhi_post 2020-04-06      Govt
1   school_20   _pre_Kolkata_post   2020-04-06      Private
2   school_30   _pre_Bengaluru_post 2020-04-06      Private
3   school_4    _pre_Mumbai_post    2020-04-06      Govt
4   school_5    _pre_Chennai_post   2020-04-06      Private
like image 43
Vaishali Avatar answered Jan 27 '23 13:01

Vaishali