I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError
exceptions when passing some to HTMLParser
.
I tried using chardet
to detect the encodings and to convert to ascii
, or utf-8
(the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().
The information is there if I just print
it out.
from HTMLParser import HTMLParser
import urllib
import chardet
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
page = urllib.urlopen(query + search).read()
try:
self.feed(page)
except UnicodeDecodeError:
encoding = chardet.detect(page)['encoding']
if encoding != 'unicode':
page = page.decode(encoding)
page = page.encode('ascii', 'ignore')
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
here's the output:
Traceback (most recent call last):
File "test.py", line 27, in <module>
results = search_youtube(searches)
File "test.py", line 23, in __init__
self.feed(page)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
UnicodeDecodeError - Python Wiki. The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode () to fail.
html.parser — Simple HTML and XHTML parser¶. Source code: Lib/html/parser.py. This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
class html.parser. HTMLParser (*, convert_charrefs=True) ¶ Create a parser instance able to parse invalid markup. If convert_charrefs is True (the default), all character references (except the ones in script / style elements) are automatically converted to the corresponding Unicode characters.
A much nicer way would be the simple sys.stdout.encoding = 'utf-8' but that doesn't work, unfortunately, because the encoding field is readonly. But this is always the father of the thought and the two solutions are just different workaround implementations of it. Python 3 (including 3.6) is already Unicode supported.
It is UTF-8, indeed. This works:
from HTMLParser import HTMLParser
import urllib
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
connection = urllib.urlopen(query + search)
encoding = connection.headers.getparam('charset')
page = connection.read().decode(encoding)
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.
What encoding does chardet say it is?
Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?
Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?
At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore')
; why?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With