Python HTMLParser: UnicodeDecodeError

Tags:

I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser.

I tried using chardet to detect the encodings and to convert to ascii, or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().

The information is there if I just print it out.

from HTMLParser import HTMLParser
import urllib
import chardet

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            page = urllib.urlopen(query + search).read()
            try:
                self.feed(page)
            except UnicodeDecodeError:
                encoding = chardet.detect(page)['encoding']
                if encoding != 'unicode':
                    page = page.decode(encoding)
                    page = page.encode('ascii', 'ignore')
                self.feed(page)
                print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

here's the output:

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    results = search_youtube(searches)
  File "test.py", line 23, in __init__
    self.feed(page)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

586

asked Jan 25 '11 04:01

Nona Urbiz

2 Answers

It is UTF-8, indeed. This works:

from HTMLParser import HTMLParser
import urllib

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            connection = urllib.urlopen(query + search)
            encoding = connection.headers.getparam('charset')
            page = connection.read().decode(encoding)
            self.feed(page)
            print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.

118

answered Sep 20 '22 17:09

Lennart Regebro

What encoding does chardet say it is?

Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?

Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?

At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore'); why?

answered Sep 21 '22 17:09

John Machin

Related questions
                            
                                Apps not popping up on macOS Big Sur 11.0.1
                            
                                Fastest way to sample most numbers with minimum difference larger than a value from a Python list
                            
                                Building Python C extension modules for Windows
                            
                                Why results of map() and list comprehension are different?
                            
                                Good Python networking libraries for building a TCP server?
                            
                                Throttling with urllib2
                            
                                How to make a simple command-line chat in Python?
                            
                                Plot GeoIP data on a World Map
                            
                                Multiple CouchDB Document fetch with couchdb-python
                            
                                How to profile a Django custom management command exclusively
                            
                                Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python
                            
                                How to override English labels inserted by Sphinx
                            
                                Detect and record a sound with python
                            
                                Difference between Python urllib.urlretrieve() and wget
                            
                                obtaining error number of an error
                            
                                Python threading.Event() - Ensuring all waiting threads wake up on event.set()
                            
                                Check memory usage of subprocess in Python
                            
                                Java equivalent of python's getattr?
                            
                                How to see function signature in Python?
                            
                                Confirming the difference between import * and from xxx import *

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python HTMLParser: UnicodeDecodeError

Tags:

python

character-encoding

html-parsing

Nona Urbiz

People also ask

2 Answers

Lennart Regebro

John Machin

Recent Activity

Donate For Us