vectorize join condition in pandas

Question

This code is working correctly as expected. But it takes a lot of time for large dataframes.

for i in excel_df['name_of_college_school'] :
    for y in mysql_df['college_name'] :
        if SequenceMatcher(None,  i.lower(), y.lower() ).ratio() > 0.8:
            excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y

I guess, I can not use a function on join clause to compare values like this. How do I vectorize this?

Update:

Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.

siddharth iyer · Accepted Answer

What you are looking for is fuzzy merging.

a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
    for j in b:
        if SequenceMatcher(None,  
               i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
            i[dupmark_index] = j

Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -

df.columns.get_loc("college name")

Zero · Answer

You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.

for y in mysql_df['college_name']:
    match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
                                            None, x.lower(), y.lower()).ratio() > 0.8)
    excel_df.loc[match, 'dupmark4'] = y

vectorize join condition in pandas

Tags:

pandas

shantanuo

2 Answers

siddharth iyer

Zero

Recent Activity

Donate For Us

vectorize join condition in pandas

Tags:

pandas

shantanuo

2 Answers

siddharth iyer

Zero

Related questions

Recent Activity

Donate For Us