How to implement Unicode string matching by folding in python

Tags:

I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given "key" string; a catalog string is a "hit" if it contains all of the characters in the key, in order, and it is ranked better if the key characters cluster in the catalog string.

Anyway, this works fine and matches unicode exactly, so that "öst" will match "Östblocket" or "röst" or "röd sten".

Anyway, now I want to implement folding, since there are some cases where it is not useful to distinguish between a catalog character such as "á" or "é" and the key character "a" or "e".

For example: "Ole" should match "Olé"

How do I best implement this unicode-folding matcher in Python? Efficiency is important since I have to match thousands of catalog strings to the short, given key.

It does not have to turn it into ascii; in fact, the algorithm's output string could be unicode. Leaving a character in is better than stripping it.

I don't know which answer to accept, since I use a bit of both. Taking the NKFD decomposition and removing combining marks goes almost the way to finish, I only add some custom transliterations to that. Here is the module as it looks now: (Warning, contains unicode chars inline, since it is much nicer to edit that way.)

# -*- encoding: UTF-8 -*-

import unicodedata
from unicodedata import normalize, category

def _folditems():
    _folding_table = {
        # general non-decomposing characters
        # FIXME: This is not complete
        u"ł" : u"l",
        u"œ" : u"oe",
        u"ð" : u"d",
        u"þ" : u"th",
        u"ß" : u"ss",
        # germano-scandinavic canonical transliterations
        u"ü" : u"ue",
        u"å" : u"aa",
        u"ä" : u"ae",
        u"æ" : u"ae",
        u"ö" : u"oe",
        u"ø" : u"oe",
    }

    for c, rep in _folding_table.iteritems():
        yield (ord(c.upper()), rep.title())
        yield (ord(c), rep)

folding_table = dict(_folditems())

def tofolded(ustr):
    u"""Fold @ustr

    Return a unicode str where composed characters are replaced by
    their base, and extended latin characters are replaced by
    similar basic latin characters.

    >>> tofolded(u"Wyłącz")
    u'Wylacz'
    >>> tofolded(u"naïveté")
    u'naivete'

    Characters from other scripts are not transliterated.

    >>> tofolded(u"Ἑλλάς") == u'Ελλας'
    True

    (These doctests pass, but should they fail, they fail hard)
    """
    srcstr = normalize("NFKD", ustr.translate(folding_table))
    return u"".join(c for c in srcstr if category(c) != 'Mn')

if __name__ == '__main__':
    import doctest
    doctest.testmod()

(And, for the actual matching if that interests anyone: I construct folded strings for all my catalog beforehand, and put the folded versions into the already-available catalog object alias property.)

856

asked Sep 11 '09 11:09

u0b34a0f6ae

1 Answers

You can use this strip_accents function to remove the accents:

def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

>>> strip_accents(u'Östblocket')
'Ostblocket'

197

answered Nov 12 '22 20:11

João Silva

Related questions
                            
                                speed of elementary mathematical operations in Numpy/Python: why is integer division slowest?
                            
                                How to convert a QByteArray to a python string in PySide2 [duplicate]
                            
                                Counting most common combination of values in dataframe column
                            
                                Tox 0% coverage
                            
                                Debugging Airflow Tasks with IDE tools?
                            
                                Could validation data be a generator in tensorflow.keras 2.0?
                            
                                How to load TF hub model from local system
                            
                                How to connect kafka topic with web endpoint using Faust Python package?
                            
                                How to overwrite data on an existing excel sheet while preserving all other sheets?
                            
                                Is One-Hot Encoding required for using PyTorch's Cross Entropy Loss Function?
                            
                                expand 1 dim vector by using taylor series of log(1+e^x) in python
                            
                                TypeError: function takes exactly 4 arguments (2 given) in cv2.rectangle function
                            
                                Why is this Julia snippet so much slower than the Python equivalent? (with dictionaries)
                            
                                Why can't I import requests in VS Code?
                            
                                Standard way to open a folder window in linux?
                            
                                How to get/set logical directory path in python
                            
                                How to add an automatic filter to a relation with SQLAlchemy?
                            
                                Can a python script persistently change a Windows environment variable? (elegantly)
                            
                                Has anyone installed MySQLdb for Python on Cygwin?
                            
                                Python/Suds: Type not found: 'xs:complexType'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement Unicode string matching by folding in python

Tags:

python

unicode

u0b34a0f6ae

People also ask

1 Answers

João Silva

Recent Activity

Donate For Us