Given the accented unicode word like u'кни́га'
, I need to strip the acute (u'книга'
), and also change the accent format to u'кни+га'
, where '+'
represents the acute over the preceding letter.
What I do now is using a dictionary of acuted and not acuted symbols:
accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е', u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))
I want to do something like this:
def changeAccentFormat(word):
for letter in accent_dict:
if letter in word:
its_index = word.index(letter)
word = word[:its_index + 1] + u'+' + word[its_index + 1:]
return word
But of course it does not work as desired. I noticed that this code:
>>> word = u'кни́га'
>>> for letter in word:
... print letter
gives
к
н
и
´
г
а
(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']
? Or maybe there is some way to solve my problem without it?
First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word
is an unicode object, not an encoded bytestring.
Now, for combination characters in Unicode:
For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:
See U+00E7, U+0063 and U+0327
So in Python, you could either write either form, it will get composed at display time to the same character:
>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç
In order to convert between composed and decomposed forms, you can use unicodedata.normalize()
:
>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç
(NFC
stands for "normal form C" (composed), and NFD
for "normal form D" (decomposed).
They still are different forms though - one consisting of one code point, the other of two:
>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2
However, in your case, there simply does not seem to be a combined character for the lowercase и
with an accent acute (there is one for и
with an accent grave)
You can produce [u'к', u'н', u'и́', u'г', u'а']
with the regex module.
Here is the word you have by each user perceived character:
>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5
Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT
, so a simple string character replacement should suffice:
>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га
If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With