There are seemingly a million questions involving Python Unicode Errors where the ...ordinal [is] not in range(128)
. Seemingly, the vast majority involve Python 2.x.
I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's HTMLParser()
and entitydefs
, as well as decoding the text from whatever its original form was (string.decode('[original encoding]', 'ignore'))
and encoding it as UTF-8 (string.encode('utf-8', 'ignore')
).
Yet, seemingly, there's always a site on which my best efforts fail, raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128).
It's so annoying.
I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective.
I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the 2to3
migration weren't an issue??
Thank you in advance for any assistance. I sure need it.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
We only need more bytes if we are sending non-English characters. It is the most popular form of encoding, and is by default the encoding in Python 3. In Python 2, the default encoding is ASCII (unfortunately).
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.
I'll speak from the point of view of a Python 2.7 user.
It's true that Python 3 introduces some big changes on the Unicode
field. I won't say it is easier to work with encodings
in Python 3, but it's indeed more reasonable for doing i18n stuff.
Like I said, I use Python 2.7 and so far I've been able to handle every encoding
problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of what encodings
is all about, of course: this is the best article there is to understand encodings.
In that article, Joel says something that you need to keep in mind every time you encounter yourself in an encoding
situation:
It does not make sense to have a string without knowing what encoding it uses.
Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:
encoding
the web page is using (you can sense this by looking at the Response headers
or in a field in BeautifulSoup
..decode()
the retrieved string using the encoding
you figured outdecode
, you don't have a str
object anymore, you have a unicode
object.unicode
is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to .encode()
it and I suggest you to use utf-8
of course.Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some encoding
but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices:
,ignore
characters that can be problematic. Just quietly pass them.encoding
is malformed To get encodings
right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match.
Python 3 improve its unicode
handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understand encodings
(ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;)
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With