approximate RegEx in python with TRE: strange unicode behavior

Tags:

I am trying to use the TRE-library in python to match misspelled input.
It is important, that it does handle utf-8 encoded Strings well.

an example:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin"

It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct.

# -*- coding: utf-8 -*-
import tre

def apro_match(word, list):
    fz = tre.Fuzzyness(maxerr=3)
    pt = tre.compile(word)
    for i in l:
        m = pt.search(i,fz)
        if m:
            print m.groups()[0],' ', m[0]

if __name__ == '__main__':
    string1 = u'Berlín'.encode('utf-8')
    string2 = u'Bärlin'.encode('utf-8')    
    string3 = u'B\xe4rlin'.encode('utf-8')
    string4 = u'Berlän'.encode('utf-8')
    string5 = u'London, Paris, Bärlin'.encode('utf-8')
    string6 = u'äerlin'.encode('utf-8')
    string7 = u'Beälin'.encode('utf-8')

    l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]

    print '\n'*2
    print "apro_match('Berlin', l)"
    print "="*20
    apro_match('Berlin', l)
    print '\n'*2

    print "apro_match('.*Berlin', l)"
    print "="*20
    apro_match('.*Berlin', l)

output

apro_match('Berlin', l)
====================
(0, 7)   Berlín
(1, 7)   ärlin
(1, 7)   ärlin
(0, 7)   Berlän
(16, 22)   ärlin
(1, 7)   ?erlin
(0, 7)   Beälin



apro_match('.*Berlin', l)
====================
(0, 7)   Berlín
(0, 7)   Bärlin
(0, 7)   Bärlin
(0, 7)   Berlän
(0, 22)   London, Paris, Bärlin
(0, 7)   äerlin
(0, 7)   Beälin

Not that for the regex '.*Berlin' it works fine, while for the regex 'Berlin'

u'Bärlin'.encode('utf-8')    
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')

are not working, while

u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')

work as expected.

Is there something I do wrong with the encoding? Do you know any trick?

595

asked Aug 04 '11 18:08

2 Answers

You could use new regex library, it supports Unicode 6.0 and fuzzy matching:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from itertools import ifilter, imap
import regex as re

def apro_match(word_re, lines, fuzzy='e<=1'):
    search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
    for m in ifilter(None, imap(search, lines)):
        print m.span(), m[0]

def main():
    lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
    lst += [u'London, Paris, Bärlin']
    lst += u'äerlin Beälin'.split()
    print
    print "apro_match('Berlin', lst)"
    print "="*25
    apro_match('Berlin', lst)
    print 
    print "apro_match('.*Berlin', lst)"
    print "="*27
    apro_match('.*Berlin', lst)

if __name__ == '__main__':
    main()

'e<=1' means that at most one error of any kind is permitted. There are three types of errors:

Insertion, indicated by "i"
Deletion, indicated by "d"
Substitution, indicated by "s"

Output

apro_match('Berlin', lst)
=========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(15, 21) Bärlin
(0, 6) äerlin
(0, 6) Beälin

apro_match('.*Berlin', lst)
===========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(0, 21) London, Paris, Bärlin
(0, 6) äerlin
(0, 6) Beälin

126

answered Oct 10 '22 00:10

jfs

Internally TRE works at the byte level and it returns byte positions. I had your same issue a while ago - there is no trick!

I modified the Python bindings, added an utf8 function and a function which builds a map from byte position to character position, and a small wrapper. Your test case works as expected when using this wrapper. I have not released the modifications, it was more of a quick hack while testing TRE - if you want them just let me know.

AFAIK TRE hasn't been updated for quite a while and there are still unfixed bugs in the current release (0.8.0) relating to pattern matching towards the end of a string (e.g. search "2004 " using pattern "2004$" gives a cost of 2, while the expected cost is 1).

As others have pointed out, for Python the new regex module seems quite interesting!

answered Oct 10 '22 02:10

j-a

Related questions
                            
                                How to install Python 2.7 devel if I have Python 2.7 in a different directory
                            
                                How to block the main thread until all the other threads finish executing?
                            
                                Turbomail Integration with Pyramid
                            
                                How to improve speed of odeint in Python?
                            
                                Regular expressions and Unicode in Python: difference between sub and findall
                            
                                pyramid: get application absolute url
                            
                                bash equivalent of Python's os.path.normpath?
                            
                                Object vs. Dictionary: how to organise a data tree?
                            
                                Implementing logic from text
                            
                                Can I do math inside Python's string formatting "language"?
                            
                                How to express a context free design grammar as an internal DSL in Python?
                            
                                How do define which spider the scrapy shell uses?
                            
                                How to create a dynamic view on OpenERP
                            
                                Advantage of using "x *= -1." over "x *= -1"?
                            
                                Number density contours in Python
                            
                                Call a Python script from a Applescript
                            
                                numpy.poly1d , root-finding optimization, shifting polynom on x-axis
                            
                                Where is function overriding done?
                            
                                sqlalchemy: joining to the same table multiple times using declarative and reflection
                            
                                Fabric and Jinja Template Uploading

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

approximate RegEx in python with TRE: strange unicode behavior

Tags:

python

regex

fuzzy-comparison

tre-library

vikingosegundo

People also ask

2 Answers

Output

jfs

j-a

Recent Activity

Donate For Us