I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:
chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])
It returns:
{'confidence': 1.0, 'encoding': 'ascii'}
How come?
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.
The first 128 characters in the Unicode library match those in the ASCII library, and UTF-8 translates these 128 Unicode characters into the same binary strings as ASCII. As a result, UTF-8 can take a text file formatted by ASCII and convert it to human-readable text without issue.
UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022.
Quoting from Python's documentation:
UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also a valid UTF-8 text.
All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)
To make it clear, check out this console session:
>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>
However, not all string with UTF-8 encoding is valid ASCII string:
>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>
So, chardet
is still right. Only if there is a character that is not ascii, chardet
would be able to tell, it's not ascii encoded.
Hope this simple explanation helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With