Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement Unicode string matching by folding in python

Tags:

python

unicode

I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given "key" string; a catalog string is a "hit" if it contains all of the characters in the key, in order, and it is ranked better if the key characters cluster in the catalog string.

Anyway, this works fine and matches unicode exactly, so that "öst" will match "Östblocket" or "röst" or "röd sten".

Anyway, now I want to implement folding, since there are some cases where it is not useful to distinguish between a catalog character such as "á" or "é" and the key character "a" or "e".

For example: "Ole" should match "Olé"

How do I best implement this unicode-folding matcher in Python? Efficiency is important since I have to match thousands of catalog strings to the short, given key.

It does not have to turn it into ascii; in fact, the algorithm's output string could be unicode. Leaving a character in is better than stripping it.


I don't know which answer to accept, since I use a bit of both. Taking the NKFD decomposition and removing combining marks goes almost the way to finish, I only add some custom transliterations to that. Here is the module as it looks now: (Warning, contains unicode chars inline, since it is much nicer to edit that way.)

# -*- encoding: UTF-8 -*-

import unicodedata
from unicodedata import normalize, category

def _folditems():
    _folding_table = {
        # general non-decomposing characters
        # FIXME: This is not complete
        u"ł" : u"l",
        u"œ" : u"oe",
        u"ð" : u"d",
        u"þ" : u"th",
        u"ß" : u"ss",
        # germano-scandinavic canonical transliterations
        u"ü" : u"ue",
        u"å" : u"aa",
        u"ä" : u"ae",
        u"æ" : u"ae",
        u"ö" : u"oe",
        u"ø" : u"oe",
    }

    for c, rep in _folding_table.iteritems():
        yield (ord(c.upper()), rep.title())
        yield (ord(c), rep)

folding_table = dict(_folditems())

def tofolded(ustr):
    u"""Fold @ustr

    Return a unicode str where composed characters are replaced by
    their base, and extended latin characters are replaced by
    similar basic latin characters.

    >>> tofolded(u"Wyłącz")
    u'Wylacz'
    >>> tofolded(u"naïveté")
    u'naivete'

    Characters from other scripts are not transliterated.

    >>> tofolded(u"Ἑλλάς") == u'Ελλας'
    True

    (These doctests pass, but should they fail, they fail hard)
    """
    srcstr = normalize("NFKD", ustr.translate(folding_table))
    return u"".join(c for c in srcstr if category(c) != 'Mn')

if __name__ == '__main__':
    import doctest
    doctest.testmod()

(And, for the actual matching if that interests anyone: I construct folded strings for all my catalog beforehand, and put the folded versions into the already-available catalog object alias property.)

like image 856
u0b34a0f6ae Avatar asked Sep 11 '09 11:09

u0b34a0f6ae


People also ask

How do you represent Unicode strings in Python?

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

Can Python string include Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

Are Python strings ASCII or Unicode?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.

What is the difference between Unicode and string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.


1 Answers

You can use this strip_accents function to remove the accents:

def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

>>> strip_accents(u'Östblocket')
'Ostblocket'
like image 197
João Silva Avatar answered Nov 12 '22 20:11

João Silva