Python string and unicode objects have following methods for string case conversion.
upper()
lower()
title()
Using unicode strings, I can handle nearly all characters in my local alphabet:
test_str = u"ças şak ürt örkl"
print test_str.upper()
>> ÇAS ŞAK ÜRT ÖRKL
Except two letters. Since I am living in Turkey, I have typical Turkish I problem
.
In my local alphabet, we have a letter İ
which is similar to I
and their case conversion must be like following
I → lowercase → ı
i → uppercase → İ
And yes, it spoils ASCII conversion of i --> I
since i
and I
are two separate letters.
test_str = u"ik"
print test_str.upper()
>> IK # Wrong! must be İK
test_str = u"IK"
print test_str.lower()
>> ik # Wrong! must be ık
How can I overcome this? Is there a way to handle case conversions correctly with using python build-ins?
The Unicode Standard is the specification of an encoding scheme for written characters and text. It is a universal standard that enables consistent encoding of multilingual text and allows text data to be interchanged internationally without conflict.
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.
Python currently doesn't have any support for locale-specific case folding, or the other rules in Unicode SpecialCasing.txt. If you need it today, you can get them from PyICU.
>>> unicode( icu.UnicodeString(u'IK').toLower(icu.Locale('TR')) )
u'ık'
Although if all you care about is the Turkish I, you might prefer to just special-case it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With