Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I iterate through unicode symbols, not bytes in python?

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.

What I do now is using a dictionary of acuted and not acuted symbols:

accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е',  u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))

I want to do something like this:

def changeAccentFormat(word):
  for letter in accent_dict:
    if letter in word:
      its_index = word.index(letter)
      word = word[:its_index + 1] + u'+' + word[its_index + 1:]
  return word

But of course it does not work as desired. I noticed that this code:

>>> word = u'кни́га'
>>> for letter in word:
...     print letter

gives

к
н
и                                                                                                                                                                                  
´   

г
а

(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?

like image 759
FrauHahnhen Avatar asked Dec 26 '13 12:12

FrauHahnhen


3 Answers

First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.

Now, for combination characters in Unicode:

For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:

Composed and decomposed form of c with cedilla

See U+00E7, U+0063 and U+0327

So in Python, you could either write either form, it will get composed at display time to the same character:

>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç

In order to convert between composed and decomposed forms, you can use unicodedata.normalize():

>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç

(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).

They still are different forms though - one consisting of one code point, the other of two:

>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2

However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)

like image 175
Lukas Graf Avatar answered Sep 23 '22 03:09

Lukas Graf


You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.

Here is the word you have by each user perceived character:

>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5
like image 41
dawg Avatar answered Sep 24 '22 03:09

dawg


Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:

>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га

If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

like image 45
loopbackbee Avatar answered Sep 24 '22 03:09

loopbackbee