I am trying to use the TRE-library in python to match misspelled input.
It is important, that it does handle utf-8 encoded Strings well.
an example:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin"
It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct.
# -*- coding: utf-8 -*-
import tre
def apro_match(word, list):
fz = tre.Fuzzyness(maxerr=3)
pt = tre.compile(word)
for i in l:
m = pt.search(i,fz)
if m:
print m.groups()[0],' ', m[0]
if __name__ == '__main__':
string1 = u'Berlín'.encode('utf-8')
string2 = u'Bärlin'.encode('utf-8')
string3 = u'B\xe4rlin'.encode('utf-8')
string4 = u'Berlän'.encode('utf-8')
string5 = u'London, Paris, Bärlin'.encode('utf-8')
string6 = u'äerlin'.encode('utf-8')
string7 = u'Beälin'.encode('utf-8')
l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]
print '\n'*2
print "apro_match('Berlin', l)"
print "="*20
apro_match('Berlin', l)
print '\n'*2
print "apro_match('.*Berlin', l)"
print "="*20
apro_match('.*Berlin', l)
output
apro_match('Berlin', l)
====================
(0, 7) Berlín
(1, 7) ärlin
(1, 7) ärlin
(0, 7) Berlän
(16, 22) ärlin
(1, 7) ?erlin
(0, 7) Beälin
apro_match('.*Berlin', l)
====================
(0, 7) Berlín
(0, 7) Bärlin
(0, 7) Bärlin
(0, 7) Berlän
(0, 22) London, Paris, Bärlin
(0, 7) äerlin
(0, 7) Beälin
Not that for the regex '.*Berlin'
it works fine, while for the regex 'Berlin'
u'Bärlin'.encode('utf-8')
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')
are not working, while
u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')
work as expected.
Is there something I do wrong with the encoding? Do you know any trick?
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
\s | Matches whitespace characters, which include the \t , \n , \r , and space characters. \S | Matches non-whitespace characters.
You could use new regex
library, it supports Unicode 6.0 and fuzzy matching:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from itertools import ifilter, imap
import regex as re
def apro_match(word_re, lines, fuzzy='e<=1'):
search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
for m in ifilter(None, imap(search, lines)):
print m.span(), m[0]
def main():
lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
lst += [u'London, Paris, Bärlin']
lst += u'äerlin Beälin'.split()
print
print "apro_match('Berlin', lst)"
print "="*25
apro_match('Berlin', lst)
print
print "apro_match('.*Berlin', lst)"
print "="*27
apro_match('.*Berlin', lst)
if __name__ == '__main__':
main()
'e<=1'
means that at most one error of any kind is permitted. There are three types of errors:
apro_match('Berlin', lst)
=========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(15, 21) Bärlin
(0, 6) äerlin
(0, 6) Beälin
apro_match('.*Berlin', lst)
===========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(0, 21) London, Paris, Bärlin
(0, 6) äerlin
(0, 6) Beälin
Internally TRE works at the byte level and it returns byte positions. I had your same issue a while ago - there is no trick!
I modified the Python bindings, added an utf8 function and a function which builds a map from byte position to character position, and a small wrapper. Your test case works as expected when using this wrapper. I have not released the modifications, it was more of a quick hack while testing TRE - if you want them just let me know.
AFAIK TRE hasn't been updated for quite a while and there are still unfixed bugs in the current release (0.8.0) relating to pattern matching towards the end of a string (e.g. search "2004 " using pattern "2004$" gives a cost of 2, while the expected cost is 1).
As others have pointed out, for Python the new regex module seems quite interesting!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With