What does unicodedata.normalize do in python?

Question

I have the following code:

import unicodedata
my_var = "this is a string"
my_var2 = " Esta es una oración que está en español "
my_var3 = unicodedata.normalize('NFKD', my_var2).encode('ascii', 'ignore')
output = my_var + my_var3
print(output)

And python finishes with the following error.

**File "C:/path/to/my/file/testing_file.py", line 5, in <module>
    output = my_var + my_var3
TypeError: Can't convert 'bytes' object to str implicitly
Process finished with exit code 1**

I would like to know what does this code do? This logic is being implemented on another project from another developer and I can't understand it at all.

How can I solve this problem? I need a string which I will manipulate after.

tripleee · Accepted Answer

In Python 3, string.encode() creates a byte string, which cannot be mixed with a regular string. You have to convert the result back to a string again; the method is predictably called decode.

my_var3 = unicodedata.normalize('NFKD', my_var2).encode('ascii', 'ignore').decode('ascii')

In Python 2, there was no hard distinction between Unicode strings and "regular" (byte) strings, but that meant many hard-to-catch bugs were introduced when programmers had careless assumptions about the encoding of strings they were manipulating.

As for what the normalization does, it makes sure characters which look identical actually are identical. For example, ñ can be represented either as the single code point U+00F1 LATIN SMALL LETTER N WITH TILDE or as the combining sequence U+006E LATIN SMALL LETTER N followed by U+0303 COMBINING TILDE. Normalization converts these so that every variation is coerced into the same representation (the D normalization prefers the decomposed, combining sequence) so that strings which represent the same text are also guaranteed to contain exactly the same code points.

Because decomposed characters in many Latin-based languages are often a sequence of a plain ASCII character followed by a number of combining diacritics which are not legacy ASCII characters, converting the string to 7-bit ASCII with the 'ignore' error handler will often strip accents but leave the text almost readable. Götterdämmerung gets converted to Gotterdammerung etc.

Andrea · Answer

You need to specify the encoding type.

Then you need to use unicode instead of string as arguments of normalize()

# -*- coding: utf-8 -*-

import unicodedata
my_var = u"this is a string"
my_var2 = u" Esta es una oración que está en español "
my_var3 = unicodedata.normalize(u'NFKD', my_var2).encode('ascii', 'ignore').decode('utf8')
output = my_var + my_var3
print(output)

What does unicodedata.normalize do in python?

Tags:

python

ascii

python-2.7

typeerror

Javier Ramirez

2 Answers

tripleee

Andrea

Recent Activity

Donate For Us

What does unicodedata.normalize do in python?

Tags:

python

ascii

python-2.7

typeerror

Javier Ramirez

2 Answers

tripleee

Andrea

Related questions

Recent Activity

Donate For Us