Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Token-based edit distance in Python?

I'm familiar with python's nltk.metrics.distance module, which is commonly used to compute edit distance of two string.

I am interested in a function which computes such distance but not char-wise as normally but token-wise. By that I mean that you can replace/add/delete whole tokens only (instead of chars).

Example of regular edit distance and my desired tokenized version:

> char_dist("aa bbbb cc",
            "aa b cc")
3                              # add 'b' character three-times

> token_dist("aa bbbb cc",
             "aa b cc")
1                              # replace 'bbbb' token with 'b' token

Is there already some function, that can compute token_dist in python? I'd rather use something already implemented and tested than writing my own piece of code. Thanks for tips.

like image 420
petrbel Avatar asked Dec 01 '22 16:12

petrbel


2 Answers

NLTK's edit_distance appears to work just as well with lists as with strings:

nltk.edit_distance("aa bbbb cc", "aa b cc")
> 3
nltk.edit_distance("aa bbbb cc".split(), "aa b cc".split())
> 1
like image 79
dadamson Avatar answered Jan 12 '23 00:01

dadamson


First, install the following:

pip install editdistance

Then the following will give you the token-wise edit distance:

import editdistance
editdistance.eval(list1, list2)

Example:

import editdistance
tokens1 = ['aa', 'bb', 'cc']
tokens2 = ['a' , 'bb', 'cc']
editdistance.eval(tokens1, tokens2)
out[4]: 1

For more information, please refere to:

https://github.com/aflc/editdistance

like image 37
CentAu Avatar answered Jan 12 '23 00:01

CentAu