Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does chardet say my UTF-8-encoded string (originally decoded from ISO-8859-1) is ASCII?

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:

chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])

It returns:

{'confidence': 1.0, 'encoding': 'ascii'}

How come?

like image 371
user809829 Avatar asked Oct 29 '13 08:10

user809829


People also ask

Is ISO 8859 the same as UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

Is UTF-8 and ASCII same?

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Can UTF-8 decode ASCII?

The first 128 characters in the Unicode library match those in the ASCII library, and UTF-8 translates these 128 Unicode characters into the same binary strings as ASCII. As a result, UTF-8 can take a text file formatted by ASCII and convert it to human-readable text without issue.

Is UTF-8 the default encoding?

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022.


1 Answers

Quoting from Python's documentation:

UTF-8 has several convenient properties:

  1. It can handle any Unicode code point.

  2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.

  3. A string of ASCII text is also a valid UTF-8 text.

All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)

To make it clear, check out this console session:

>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>> 

However, not all string with UTF-8 encoding is valid ASCII string:

>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> 

So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.

Hope this simple explanation helps!

like image 114
aIKid Avatar answered Oct 05 '22 02:10

aIKid