There are seemingly a million questions involving Python Unicode Errors where the <code>...ordinal [is] not in range(128)</code>. Seemingly, the vast majority involve Python 2.x. I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's <code>HTMLParser()</code> and <code>entitydefs</code>, as well as decoding the text from whatever its original form was (<code>string.decode('[original encoding]', 'ignore'))</code> and encoding it as UTF-8 (<code>string.encode('utf-8', 'ignore')</code>). Yet, seemingly, there's always a site on which my best efforts fail, raising the same old <code>UnicodeError: ASCII decoding error...ordinal not in range(128).</code> It's so annoying. I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective. I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the <code>2to3</code> migration weren't an issue?? Thank you in advance for any assistance. I sure need it.

I'll speak from the point of view of a Python 2.7 user. It's true that Python 3 introduces some big changes on the <code>Unicode</code> field. I won't say it is easier to work with <code>encodings</code> in Python 3, but it's indeed more reasonable for doing i18n stuff. Like I said, I use Python 2.7 and so far I've been able to handle every <code>encoding</code> problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of what <code>encodings</code> is all about, of course: this is the best article there is to understand encodings. In that article, Joel says something that you need to keep in mind every time you encounter yourself in an <code>encoding</code> situation: <blockquote> It does not make sense to have a string without knowing what encoding it uses. </blockquote> Having said that, my suggestion to approach your problem with Python 2.7 would be something like this: <ol> <li>Read Joel's article of course (great reading and will take only 30 minutes or less)</li> <li>Figure out what <code>encoding</code> the web page is using (you can sense this by looking at the <code>Response headers</code> or in a field in <code>BeautifulSoup</code>.</li> <li> <code>.decode()</code> the retrieved string using the <code>encoding</code> you figured out</li> <li>When you <code>decode</code>, you don't have a <code>str</code> object anymore, you have a <code>unicode</code> object.</li> <li> <code>unicode</code> is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to <code>.encode()</code> it and I suggest you to use <code>utf-8</code> of course.</li> </ol> Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some <code>encoding</code> but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices: <ol> <li> <code>,ignore</code> characters that can be problematic. Just quietly pass them.</li> <li>There are good python libraries that try to figure out what encoding a string is using. Those are very accurate but of course, not a silver bullet. They can fail to guess, specially when the <code>encoding</code> is malformed </li> <li>Get angry and drop the project ;) (I really don't recommend this one)</li> </ol> To get <code>encodings</code> right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match. Python 3 improve its <code>unicode</code> handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understand <code>encodings</code> (ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;) Hope this helps!

Is Python 3.3 better than 2.7 for Decoding and Re-Encoding Scraped Web Text to UTF-8?? Like, a lot better?

Tags:

python

python-3.x

encoding

unicode

python-2.7

There are seemingly a million questions involving Python Unicode Errors where the ...ordinal [is] not in range(128). Seemingly, the vast majority involve Python 2.x.

I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's HTMLParser() and entitydefs, as well as decoding the text from whatever its original form was (string.decode('[original encoding]', 'ignore')) and encoding it as UTF-8 (string.encode('utf-8', 'ignore')).

Yet, seemingly, there's always a site on which my best efforts fail, raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128). It's so annoying.

I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective.

I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the 2to3 migration weren't an issue??

Thank you in advance for any assistance. I sure need it.

500

asked Dec 12 '13 21:12

Bee Smears

1 Answers

I'll speak from the point of view of a Python 2.7 user.

It's true that Python 3 introduces some big changes on the Unicode field. I won't say it is easier to work with encodings in Python 3, but it's indeed more reasonable for doing i18n stuff.

Like I said, I use Python 2.7 and so far I've been able to handle every encoding problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of what encodings is all about, of course: this is the best article there is to understand encodings.

In that article, Joel says something that you need to keep in mind every time you encounter yourself in an encoding situation:

It does not make sense to have a string without knowing what encoding it uses.

Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:

Read Joel's article of course (great reading and will take only 30 minutes or less)
Figure out what encoding the web page is using (you can sense this by looking at the Response headers or in a field in BeautifulSoup.
.decode() the retrieved string using the encoding you figured out
When you decode, you don't have a str object anymore, you have a unicode object.
unicode is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to .encode() it and I suggest you to use utf-8 of course.

Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some encoding but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices:

,ignore characters that can be problematic. Just quietly pass them.
There are good python libraries that try to figure out what encoding a string is using. Those are very accurate but of course, not a silver bullet. They can fail to guess, specially when the encoding is malformed
Get angry and drop the project ;) (I really don't recommend this one)

To get encodings right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match.

Python 3 improve its unicode handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understand encodings (ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;)

Hope this helps!

145

answered Sep 25 '22 05:09

Paulo Bu

Related questions
                            
                                matplotlib contour input array order
                            
                                Technique for using std::ifstream, std::ofstream in python via SWIG?
                            
                                Python string identity: `is` and `in` statements [duplicate]
                            
                                how to specify a range in numpy.piecewise (2 conditions per range)
                            
                                Regression along a dimension in a numpy array
                            
                                sqlalchemy session not getting removed properly in flask testing
                            
                                Convert array of single integer pixels to RGB triplets in Python
                            
                                Memory error allocating list of 11,464,882 empty dicts
                            
                                Why isn't my database working in this Python/Django app?
                            
                                Remove rows in 3D numpy array
                            
                                Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?
                            
                                Time delay Tkinter
                            
                                Looking for a simple OpenGL (3.2+) Python example that uses GLFW [closed]
                            
                                How do I list all instantiated objects in Python?
                            
                                Python: How does multiple assignments in a single line work?
                            
                                Get indices of numpy.argmax elements over an axis
                            
                                Logging with multiprocessing madness
                            
                                Python: use the same class instance in multiple modules
                            
                                silhouette coefficient in python with sklearn
                            
                                Python 3 __getattr__ behaving differently than in Python 2?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With