I am writing a series of tests for a palindrome solver. I came across the interesting palindrome in Hebrew:
טעם לפת תפל מעט
Which is a palindrome, but the letter Mem has both a regular form (מ) and a "final form" (ם), how it appears as the last letter in a word. But short of hardcoding that "0x5de => 0x5dd" in my program, I was not able to figure out a way to programmatically rely on Unicode, Python, or a library that would treat the two as the same. Things I did try:
s = 'טעם לפת תפל מעט'
s.casefold() # Python 3.4
s.lower()
s.upper()
import unicodedata
unicodedata.normalize(...) # In case this functioned like a German Eszett
All yielded the same string. Other Hebrew letters that would cause this problem (in case someone searches for this later) would be Kaf, Nun, Peh, and Tsadeh. No, I am not a native speaker of Hebrew.
You can make a slightly more "rigorous" answer (one that's less likely to give false positives and false negatives), with a little more work. Note that Patrick Collin's answer could fail by matching lots of unrelated characters because they share the last word in their unicode data name.
One thing you can do is a stricter approach at converting final letters:
import unicodedata
# Note the added accents
phrase = 'טעם̀ לפת תפל מ̀עט'
def convert_final_characters(phrase):
for character in phrase:
try:
name = unicodedata.name(character)
except ValueError:
yield character
continue
if "HEBREW" in name and " FINAL" in name:
try:
yield unicodedata.lookup(name.replace(" FINAL", ""))
except KeyError:
# Fails for HEBREW LETTER WIDE FINAL MEM "ﬦ",
# which has no non-final counterpart
#
# No failure if you first normalize to
# HEBREW LETTER FINAL MEM "ם"
yield character
else:
yield character
phrase = "".join(convert_final_characters(phrase))
phrase
#>>> 'טעמ̀ לפת תפל מ̀עט'
This just looks for Hebrew characters where "FINAL" can be removed, and does that.
You can then also convert to graphemes using the "new" regex
module on PyPI.
import regex
# "\X" matches graphemes
graphemes = regex.findall("\X", phrase)
graphemes
#>>> ['ט', 'ע', 'מ̀', ' ', 'ל', 'פ', 'ת', ' ', 'ת', 'פ', 'ל', ' ', 'מ̀', 'ע', 'ט']
graphemes == graphemes[::-1]
#>>> True
This deals with accents and other combining characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With