I'm familiar with python's nltk.metrics.distance
module, which is commonly used to compute edit distance of two string.
I am interested in a function which computes such distance but not char-wise as normally but token-wise. By that I mean that you can replace/add/delete whole tokens only (instead of chars).
Example of regular edit distance and my desired tokenized version:
> char_dist("aa bbbb cc",
"aa b cc")
3 # add 'b' character three-times
> token_dist("aa bbbb cc",
"aa b cc")
1 # replace 'bbbb' token with 'b' token
Is there already some function, that can compute token_dist
in python? I'd rather use something already implemented and tested than writing my own piece of code. Thanks for tips.
NLTK's edit_distance
appears to work just as well with lists as with strings:
nltk.edit_distance("aa bbbb cc", "aa b cc")
> 3
nltk.edit_distance("aa bbbb cc".split(), "aa b cc".split())
> 1
First, install the following:
pip install editdistance
Then the following will give you the token-wise edit distance:
import editdistance
editdistance.eval(list1, list2)
Example:
import editdistance
tokens1 = ['aa', 'bb', 'cc']
tokens2 = ['a' , 'bb', 'cc']
editdistance.eval(tokens1, tokens2)
out[4]: 1
For more information, please refere to:
https://github.com/aflc/editdistance
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With