dealing with multiple charset in python 3

Question

I'm using python 3.3.0 in Windows 8.

requrl = urllib.request.Request(url) 

response = urllib.request.urlopen(requrl)

source = response.read()

source = source.decode('utf-8')

It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset. So, how to deal with multiple charset?

Now let me tell you my efforts when I tried to resolve this issue like:

    b1 = b'charset=iso-8859-1'
    b1 = b1.decode('iso-8859-1')

    if b1 in source:
            source = source.decode('iso-8859-1')

It gave me an error like TypeError: Type str doesn't support the buffer API So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(

Please, don't say that manually change charset in the source code or have you read python docs! I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!

Francis Avila · Accepted Answer

In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).

The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.

Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!

I recommend you either:

Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.

If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!

Here are the rules for interpreting the charset declared in a content-type header.

With no explicit charset declared:
1. text/* (e.g., text/html) is in ASCII.
2. application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
1. if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
2. otherwise use the charset declared.

(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)

dealing with multiple charset in python 3

Tags:

python

character-encoding

python-3.3

magneto

1 Answers

Francis Avila

Recent Activity

Donate For Us

dealing with multiple charset in python 3

Tags:

python

character-encoding

python-3.3

magneto

1 Answers

Francis Avila

Related questions

Recent Activity

Donate For Us