Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to compare Arabic characters without regard to their initial/medial/final form?

In Latin script, letters have an upper case and a lower case form. In Python, if you want to compare two strings without regard to their case, you can convert them to the same case using 'string'.upper() or 'string'.lower()

In Arabic script, letters can have an initial, medial, or final form. Is there a similar way to compare strings of Arabic characters without caring which form the letters are in?

like image 755
drs Avatar asked May 05 '15 01:05

drs


1 Answers

There are two parts to this, which should work for all languages:*

  • Your strings must be into NFKD normalization to guarantee that two equal strings have equal code units.
  • To ignore case in comparing two NFKD strings, use the Unicode case-folding algorithm.

Between the two, this handles English upper and lower case, Arabic initial/medial/final (plus isolated), German ß vs. ss, é as a single code point vs. e\N{COMBINING ACUTE ACCENT}, Chinese rotated characters, Japanese half-width kana, and probably all kinds of other things you haven't thought of.

In Python, that looks like this:

>>> s1 = 'ﻧ'
>>> s2 = 'ﻨ'
>>> unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)
True

Note that casefold wasn't added until Python 3.3. If you're using an earlier version of Python, there are implementations on PyPI; using them should be similar to using the 3.3+ builtin.


If you're interested in exactly how this works for Arabic, rather than just the fact that it works for Arabic along with every other language, you have read the algorithms and tables at unicode.org. IIRC, the W3C document that recommends doing this explains why it works using Arabic as an example. I believe it's because Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character, so normalizing to decomposed gives you effectively the isolated form plus a modifier that casefolding can skip or transform, even though casefolding directly on a combined character just returns the character itself.


* There are a few cases where two different languages or cultures use the same script, but have different case-folding rules; in that case, you need locale-specific casefolding, which Python doesn't include. But that shouldn't be relevant here.

like image 86
abarnert Avatar answered Sep 21 '22 09:09

abarnert