Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Python 3.3 better than 2.7 for Decoding and Re-Encoding Scraped Web Text to UTF-8?? Like, a lot better?

There are seemingly a million questions involving Python Unicode Errors where the ...ordinal [is] not in range(128). Seemingly, the vast majority involve Python 2.x.

I know about these errors because I am currently in encoding, decoding hell. For a side-project, I scrape web pages and attempt to normalize that text data, so that it doesn't appear on our site with crazy characters. To normalize the data, I rely on HTMLParser's HTMLParser() and entitydefs, as well as decoding the text from whatever its original form was (string.decode('[original encoding]', 'ignore')) and encoding it as UTF-8 (string.encode('utf-8', 'ignore')).

Yet, seemingly, there's always a site on which my best efforts fail, raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128). It's so annoying.

I've read (here and here) that in Python 3 all text is Unicode. While I've read a lot about Unicode, because I'm not a software engineer, I don't know whether Unicode is objectively better (i.e., lower failure rate) than 2.x's default ascii encoding option. I have to think anything would be better, but I'd like if someone more expert and experienced could lend some perspective.

I'd like to know whether I should migrate to Python 3 for its (improved) processing of text scraped from the web. I am hoping that someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. Is it better?? Is there someone who's dealt with my same problem who's already migrated to Python 3?? Would he/she recommend that I start using Python 3, if the 2to3 migration weren't an issue??

Thank you in advance for any assistance. I sure need it.

like image 500
Bee Smears Avatar asked Dec 12 '13 21:12

Bee Smears


People also ask

What encoding does Python 3 use?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

What is the default string encoding in Python 2 and Python 3?

We only need more bytes if we are sending non-English characters. It is the most popular form of encoding, and is by default the encoding in Python 3. In Python 2, the default encoding is ASCII (unfortunately).

What is the default encoding for bytes decode () in Python 3?

Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.

Are Python strings utf8?

The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.


1 Answers

I'll speak from the point of view of a Python 2.7 user.

It's true that Python 3 introduces some big changes on the Unicode field. I won't say it is easier to work with encodings in Python 3, but it's indeed more reasonable for doing i18n stuff.

Like I said, I use Python 2.7 and so far I've been able to handle every encoding problem I've found. You just have to understand what's going on under the hood, and have a very reasonable background of what encodings is all about, of course: this is the best article there is to understand encodings.

In that article, Joel says something that you need to keep in mind every time you encounter yourself in an encoding situation:

It does not make sense to have a string without knowing what encoding it uses.

Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:

  1. Read Joel's article of course (great reading and will take only 30 minutes or less)
  2. Figure out what encoding the web page is using (you can sense this by looking at the Response headers or in a field in BeautifulSoup.
  3. .decode() the retrieved string using the encoding you figured out
  4. When you decode, you don't have a str object anymore, you have a unicode object.
  5. unicode is just an internal representation, not a real encoding, so if you want to output the content somewhere, you'll have to .encode() it and I suggest you to use utf-8 of course.

Now, some points have to be understood. Maybe the web page you're scraping is not encoding aware and it says it uses some encoding but doesn't stick to it. This is an error made by the webmaster, but you have to do something to figure it out. You have either 3 choices:

  1. ,ignore characters that can be problematic. Just quietly pass them.
  2. There are good python libraries that try to figure out what encoding a string is using. Those are very accurate but of course, not a silver bullet. They can fail to guess, specially when the encoding is malformed
  3. Get angry and drop the project ;) (I really don't recommend this one)

To get encodings right, some amount of discipline is needed from the source and from the client. You have to develop your program right, but you need that the information about encoding and the real encoding at the source match.

Python 3 improve its unicode handling but if you don't understand what is going on, it will probably be useless. The best thing you can do is understand encodings (ain't that hard, again, read Joel!) and once you understand it, you'll be able to process it with Python 2.7, Python 3.3 and even PHP ;)

Hope this helps!

like image 145
Paulo Bu Avatar answered Sep 25 '22 05:09

Paulo Bu