Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

vectorize join condition in pandas

Tags:

pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.

for i in excel_df['name_of_college_school'] :
    for y in mysql_df['college_name'] :
        if SequenceMatcher(None,  i.lower(), y.lower() ).ratio() > 0.8:
            excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y

I guess, I can not use a function on join clause to compare values like this. How do I vectorize this?


Update:

Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.

like image 742
shantanuo Avatar asked Sep 18 '17 06:09

shantanuo


2 Answers

What you are looking for is fuzzy merging.

a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
    for j in b:
        if SequenceMatcher(None,  
               i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
            i[dupmark_index] = j

Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -

df.columns.get_loc("college name")
like image 107
siddharth iyer Avatar answered Oct 28 '22 09:10

siddharth iyer


You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.

for y in mysql_df['college_name']:
    match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
                                            None, x.lower(), y.lower()).ratio() > 0.8)
    excel_df.loc[match, 'dupmark4'] = y
like image 22
Zero Avatar answered Oct 28 '22 09:10

Zero