Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this Turkish character being corrupted when I lowercase it?

I am trying to convert some words that contains Turkish characters to lowercase.

Reading words from a file which is utf-8 encoded:

with open(filepath,'r', encoding='utf8') as f:
            text=f.read().lower()

When I try to convert to lowercase, the Turkish character İ gets corrupted. However when I try to convert to uppercase it works fine.

Here is example code:

str = 'İşbirliği'
print(str)
print(str.lower())

Here is how it looks when it is corrupted:

this is how it is seen when it is corrupted

What's going on here?

Some info that might be useful:

  • I'm using Windows 10 cmd prompt
  • Python version 3.6.0
  • chcp is set to 65001
like image 257
moth Avatar asked Feb 04 '23 20:02

moth


1 Answers

It's not corrupted.

Turkish has both a dotted lowercase i and a dotless lowercase ı, and similarly a dotted uppercase İ and a dotless uppercase I.

This presents a challenge when converting the dotted uppercase İ to lowercase: how to retain the information that, if it needs to be converted back to uppercase, it should be converted back to the dotted İ?

Unicode solves this problem as follows: when İ is converted to lowercase, it's actually converted to the standard latin i plus the combining character U+0307 "COMBINING DOT ABOVE". What you're seeing is your terminal's inability to properly render (or, more to the point, refrain from rendering) the combining character, and has nothing to do with Python.

You can see that this is happening using unicodedata.name():

>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

... although, in a working and correctly configured terminal, it will render without any problems:

>>> 'İ'.lower()
'i̇'

As a side note, if you do convert it back to uppercase, it will remain in the decomposed form:

>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']

… although you can recombine it with unicodedata.normalize():

>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']

For more information, see:

  • Internationalization for Turkish: Dotted and Dotless Letter "I"
  • What's Wrong With Turkey?
like image 72
Zero Piraeus Avatar answered Feb 07 '23 18:02

Zero Piraeus