I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given "key" string; a catalog string is a "hit" if it contains all of the characters in the key, in order, and it is ranked better if the key characters cluster in the catalog string.
Anyway, this works fine and matches unicode exactly, so that "öst" will match "Östblocket" or "röst" or "röd sten".
Anyway, now I want to implement folding, since there are some cases where it is not useful to distinguish between a catalog character such as "á" or "é" and the key character "a" or "e".
For example: "Ole" should match "Olé"
How do I best implement this unicode-folding matcher in Python? Efficiency is important since I have to match thousands of catalog strings to the short, given key.
It does not have to turn it into ascii; in fact, the algorithm's output string could be unicode. Leaving a character in is better than stripping it.
I don't know which answer to accept, since I use a bit of both. Taking the NKFD decomposition and removing combining marks goes almost the way to finish, I only add some custom transliterations to that. Here is the module as it looks now: (Warning, contains unicode chars inline, since it is much nicer to edit that way.)
# -*- encoding: UTF-8 -*-
import unicodedata
from unicodedata import normalize, category
def _folditems():
_folding_table = {
# general non-decomposing characters
# FIXME: This is not complete
u"ł" : u"l",
u"œ" : u"oe",
u"ð" : u"d",
u"þ" : u"th",
u"ß" : u"ss",
# germano-scandinavic canonical transliterations
u"ü" : u"ue",
u"å" : u"aa",
u"ä" : u"ae",
u"æ" : u"ae",
u"ö" : u"oe",
u"ø" : u"oe",
}
for c, rep in _folding_table.iteritems():
yield (ord(c.upper()), rep.title())
yield (ord(c), rep)
folding_table = dict(_folditems())
def tofolded(ustr):
u"""Fold @ustr
Return a unicode str where composed characters are replaced by
their base, and extended latin characters are replaced by
similar basic latin characters.
>>> tofolded(u"Wyłącz")
u'Wylacz'
>>> tofolded(u"naïveté")
u'naivete'
Characters from other scripts are not transliterated.
>>> tofolded(u"Ἑλλάς") == u'Ελλας'
True
(These doctests pass, but should they fail, they fail hard)
"""
srcstr = normalize("NFKD", ustr.translate(folding_table))
return u"".join(c for c in srcstr if category(c) != 'Mn')
if __name__ == '__main__':
import doctest
doctest.testmod()
(And, for the actual matching if that interests anyone: I construct folded strings for all my catalog beforehand, and put the folded versions into the already-available catalog object alias property.)
To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
You can use this strip_accents
function to remove the accents:
def strip_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))
>>> strip_accents(u'Östblocket')
'Ostblocket'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With