I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters: <pre class="prettyprint"><code>chunk = chunk.decode('ISO-8859-1').encode('UTF-8') print chardet.detect(chunk[0:2000]) </code></pre> It returns: <pre class="prettyprint"><code>{'confidence': 1.0, 'encoding': 'ascii'} </code></pre> How come?

Quoting from Python's documentation: <blockquote> UTF-8 has several convenient properties: <ol> <li> It can handle any Unicode code point. </li> <li> A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes. </li> <li> A string of ASCII text is also a valid UTF-8 text. </li> </ol> </blockquote> All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII) To make it clear, check out this console session: <pre class="prettyprint"><code>>>> s = 'test' >>> s.encode('ascii') == s.encode('utf-8') True >>> </code></pre> However, not all string with UTF-8 encoding is valid ASCII string: <pre class="prettyprint"><code>>>> foreign_string = u"éâô" >>> foreign_string.encode('utf-8') '\xc3\xa9\xc3\xa2\xc3\xb4' >>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding Traceback (most recent call last): File "<pyshell#9>", line 1, in <module> foreign_string.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> </code></pre> So, <code>chardet</code> is still right. Only if there is a character that is not ascii, <code>chardet</code> would be able to tell, it's not ascii encoded. Hope this simple explanation helps!

Why does chardet say my UTF-8-encoded string (originally decoded from ISO-8859-1) is ASCII?

Tags:

python

encoding

ascii

utf-8

decoding

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:

chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])

It returns:

{'confidence': 1.0, 'encoding': 'ascii'}

How come?

371

asked Oct 29 '13 08:10

user809829

1 Answers

Quoting from Python's documentation:

UTF-8 has several convenient properties:

It can handle any Unicode code point.

A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.

A string of ASCII text is also a valid UTF-8 text.

All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)

To make it clear, check out this console session:

>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>

However, not all string with UTF-8 encoding is valid ASCII string:

>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>

So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.

Hope this simple explanation helps!

114

answered Oct 05 '22 02:10

aIKid

Related questions
                            
                                Why l.insert(0, i) is slower than l.append(i) in python?
                            
                                How can I remove carriage return from a text file with Python?
                            
                                SQLAlchemy Generic Relationship simple example
                            
                                Conditional formatting for 2- or 3-scale coloring of cells of a table
                            
                                Writing Python ctypes for Function pointer callback function in C
                            
                                Drawing lattices and graphs with Networkx
                            
                                python logging format: how to add bracket
                            
                                What can I use to go one line break back in a terminal in Python?
                            
                                How can I convert HH:MM:SS string to UNIX epoch time?
                            
                                Does importing more slow down scripts python?
                            
                                Bytes message argument error
                            
                                Why do I lose precision while multiplying and dividing whole ints?
                            
                                How can i read NamedTemporaryFile in python.?
                            
                                Is Python list.extend() Order Presserving?
                            
                                Getting certificate chain with Python 3.3 SSL module
                            
                                Check array for values equal or very close to zero
                            
                                Navigating Multi-Dimensional JSON arrays in Python
                            
                                How to pre-populate checkboxes with Flask/WTForms
                            
                                Lexical Scope in Python vs ML
                            
                                Strange behavior when defining a value for True in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With