Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing case of letters in unicode string containing accent and local letters

Python string and unicode objects have following methods for string case conversion.

  • upper()
  • lower()
  • title()

Using unicode strings, I can handle nearly all characters in my local alphabet:

test_str = u"ças şak ürt örkl"
print test_str.upper()
>> ÇAS ŞAK ÜRT ÖRKL

Except two letters. Since I am living in Turkey, I have typical Turkish I problem.

In my local alphabet, we have a letter İ which is similar to I and their case conversion must be like following

I → lowercase → ı

i → uppercase → İ

And yes, it spoils ASCII conversion of i --> I since i and I are two separate letters.

test_str = u"ik"
print test_str.upper()
>> IK  # Wrong! must be İK
test_str = u"IK"
print test_str.lower()
>> ik  # Wrong! must be ık

How can I overcome this? Is there a way to handle case conversions correctly with using python build-ins?

like image 331
FallenAngel Avatar asked Mar 05 '14 13:03

FallenAngel


People also ask

How to define Unicode in c++?

The Unicode Standard is the specification of an encoding scheme for written characters and text. It is a universal standard that enables consistent encoding of multilingual text and allows text data to be interchanged internationally without conflict.

How do I remove the accented character in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

How do I remove the accent from a string in Java?

Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.


1 Answers

Python currently doesn't have any support for locale-specific case folding, or the other rules in Unicode SpecialCasing.txt. If you need it today, you can get them from PyICU.

>>> unicode( icu.UnicodeString(u'IK').toLower(icu.Locale('TR')) )
u'ık'

Although if all you care about is the Turkish I, you might prefer to just special-case it.

like image 200
bobince Avatar answered Sep 18 '22 17:09

bobince