Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.

Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.

The line where the error occurred is the 3rd one:

from BeautifulSoup  import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)

CLI full output is:

Traceback (most recent call last):
  File "./grablinks", line 101, in <module>
    sys.exit(main())
  File "./grablinks", line 88, in main
    links = grab_links(options)
  File "./grablinks", line 36, in grab_links
    doc = doc_parser(reader)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
        k = self.parse_endtag(i)
  File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
    method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
like image 572
Tzury Bar Yochay Avatar asked Nov 04 '22 08:11

Tzury Bar Yochay


1 Answers

Yeah, It will choke if you have elements with non-ASCII names (<café>). And that's not even ‘bad markup’, for XML...

It's a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.

You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that's not really a good fix. Not trivial to override the rest of the method either.

What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.

like image 65
bobince Avatar answered Nov 09 '22 05:11

bobince