Same character, different length and bytes [duplicate]

Question

Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:

>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>

Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:

What is the difference between these two strings?
Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
And how can I fix this? (e.g. convert second_string to the likeness of first_string)

Thank you.

tremby · Accepted Answer

An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.

The first one is:
```
<â> 226, Hex 00e2, Octal 342
```
And the second:
```
<a>  97,  Hex 61,  Octal 141 < ̂> 770, Hex 0302, Octal 1402
```
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.

As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.

clemens · Answer

There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
```
from unicodedata import normalize

s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
```

will output â 2 1.

Same character, different length and bytes [duplicate]

Tags:

python

character-encoding

unicode

Syphon

2 Answers

tremby

clemens

Recent Activity

Donate For Us

Same character, different length and bytes [duplicate]

Tags:

python

character-encoding

unicode

Syphon

2 Answers

tremby

clemens

Related questions

Recent Activity

Donate For Us