Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect a palindrome in Hebrew?

I am writing a series of tests for a palindrome solver. I came across the interesting palindrome in Hebrew:

טעם לפת תפל מעט

Which is a palindrome, but the letter Mem has both a regular form (מ) and a "final form" (ם), how it appears as the last letter in a word. But short of hardcoding that "0x5de => 0x5dd" in my program, I was not able to figure out a way to programmatically rely on Unicode, Python, or a library that would treat the two as the same. Things I did try:

s = 'טעם לפת תפל מעט'
s.casefold() # Python 3.4
s.lower()
s.upper()
import unicodedata
unicodedata.normalize(...) # In case this functioned like a German Eszett

All yielded the same string. Other Hebrew letters that would cause this problem (in case someone searches for this later) would be Kaf, Nun, Peh, and Tsadeh. No, I am not a native speaker of Hebrew.

like image 358
heptadecagram Avatar asked Jun 20 '14 14:06

heptadecagram


1 Answers

You can make a slightly more "rigorous" answer (one that's less likely to give false positives and false negatives), with a little more work. Note that Patrick Collin's answer could fail by matching lots of unrelated characters because they share the last word in their unicode data name.

One thing you can do is a stricter approach at converting final letters:

import unicodedata

# Note the added accents
phrase = 'טעם̀ לפת תפל מ̀עט'

def convert_final_characters(phrase):
    for character in phrase:
        try:
            name = unicodedata.name(character)
        except ValueError:
            yield character
            continue

        if "HEBREW" in name and " FINAL" in name:
            try:
                yield unicodedata.lookup(name.replace(" FINAL", ""))
            except KeyError:
                # Fails for HEBREW LETTER WIDE FINAL MEM "ﬦ",
                # which has no non-final counterpart
                #
                # No failure if you first normalize to
                # HEBREW LETTER FINAL MEM "ם"
                yield character
        else:
            yield character

phrase = "".join(convert_final_characters(phrase))
phrase
#>>> 'טעמ̀ לפת תפל מ̀עט'

This just looks for Hebrew characters where "FINAL" can be removed, and does that.


You can then also convert to graphemes using the "new" regex module on PyPI.

import regex

# "\X" matches graphemes
graphemes = regex.findall("\X", phrase)
graphemes
#>>> ['ט', 'ע', 'מ̀', ' ', 'ל', 'פ', 'ת', ' ', 'ת', 'פ', 'ל', ' ', 'מ̀', 'ע', 'ט']

graphemes == graphemes[::-1]
#>>> True

This deals with accents and other combining characters.

like image 116
Veedrac Avatar answered Sep 26 '22 00:09

Veedrac