Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python HTMLParser: UnicodeDecodeError

I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser.

I tried using chardet to detect the encodings and to convert to ascii, or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().

The information is there if I just print it out.

from HTMLParser import HTMLParser
import urllib
import chardet

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            page = urllib.urlopen(query + search).read()
            try:
                self.feed(page)
            except UnicodeDecodeError:
                encoding = chardet.detect(page)['encoding']
                if encoding != 'unicode':
                    page = page.decode(encoding)
                    page = page.encode('ascii', 'ignore')
                self.feed(page)
                print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

here's the output:

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    results = search_youtube(searches)
  File "test.py", line 23, in __init__
    self.feed(page)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
like image 586
Nona Urbiz Avatar asked Jan 25 '11 04:01

Nona Urbiz


People also ask

What is unicodedecodeerror in Python?

UnicodeDecodeError - Python Wiki. The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode () to fail.

What is HTML parser in Python?

html.parser — Simple HTML and XHTML parser¶. Source code: Lib/html/parser.py. This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

How do I parse an invalid character in HTML?

class html.parser. HTMLParser (*, convert_charrefs=True) ¶ Create a parser instance able to parse invalid markup. If convert_charrefs is True (the default), all character references (except the ones in script / style elements) are automatically converted to the corresponding Unicode characters.

Is there a way to enable Unicode in Python?

A much nicer way would be the simple sys.stdout.encoding = 'utf-8' but that doesn't work, unfortunately, because the encoding field is readonly. But this is always the father of the thought and the two solutions are just different workaround implementations of it. Python 3 (including 3.6) is already Unicode supported.


2 Answers

It is UTF-8, indeed. This works:

from HTMLParser import HTMLParser
import urllib

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            connection = urllib.urlopen(query + search)
            encoding = connection.headers.getparam('charset')
            page = connection.read().decode(encoding)
            self.feed(page)
            print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.

like image 118
Lennart Regebro Avatar answered Sep 20 '22 17:09

Lennart Regebro


What encoding does chardet say it is?

Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?

Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?

At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore'); why?

like image 40
John Machin Avatar answered Sep 21 '22 17:09

John Machin