Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Edit distance between two pandas columns

I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.

from nltk.metrics import edit_distance    
df['edit'] = edit_distance(df['column1'], df['column2'])

For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.

Any suggestions are welcome.

like image 405
Orest Xherija Avatar asked Mar 19 '17 21:03

Orest Xherija


People also ask

Is edit distance same as levenshtein?

The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).

How do I strip space in pandas column?

To strip whitespace from columns in Pandas we can use the str. strip(~) method or the str. replace(~) method.

What is edit distance in Python?

The edit distance between two strings refers to the minimum number of character insertions, deletions, and substitutions required to change one string to the other. For example, the edit distance between "kitten" and "sitting" is three: substitute the "k" for "s", substitute the "e" for "i", and append a "g".

How do I change the location of a column in pandas?

Pandas Change Position of a Column (Last to the First) You can change the position of a pandas column in multiple ways, the simplest way would be to select the columns by positioning the last column in the first position. You can also use this approach to change the order of pandas columns in the desired order.


1 Answers

The nltk's edit_distance function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply it separately to each row's strings like this:

results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)

Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:

results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)

To add the results to your dataframe, you'd use it like this:

df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)
like image 191
alexis Avatar answered Oct 09 '22 09:10

alexis