Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google App Engine Python 2.7 + lxml = Unicode ParserError

I am trying to use BeautifulSoup v4 to parse a document. I call BeautifulSoup on note.content, which is a string returned by Evernote's API:

soup = BeautifulSoup(note.content)

I have enabled lxml in my app.yaml file:

libraries:
- name: lxml
  version: "2.3"

Note that this works on my local development server. However, when deployed to Google's cloud I get the following error:

Error Trace:

Unicode parsing is not supported on this platform
Traceback (most recent call last):
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
    return handler.dispatch()
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~ever-blog/1.356951374446096208/controller/blog.py", line 101, in get
    soup = BeautifulSoup(note.content)
  File "/base/data/home/apps/s~ever-blog/1.356951374446096208/lib/bs4/__init__.py", line 168, in __init__
    self._feed()
  File "/base/data/home/apps/s~ever-blog/1.356951374446096208/lib/bs4/__init__.py", line 181, in _feed
    self.builder.feed(self.markup)
  File "/base/data/home/apps/s~ever-blog/1.356951374446096208/lib/bs4/builder/_lxml.py", line 62, in feed
    self.parser.feed(markup)
  File "parser.pxi", line 1077, in lxml.etree._FeedParser.feed (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:76196)
ParserError: Unicode parsing is not supported on this platform

UPDATE:

I checked out parser.pxi, and I found these lines of code which generated the error:

elif python.PyUnicode_Check(data):
            if _UNICODE_ENCODING is NULL:
                raise ParserError, \
                    u"Unicode parsing is not supported on this platform"

I think there must be something about GAE's deployement environment which causes this error, but I am not sure what.

UPDATE 2:

Because BeautifulSoup will automatically fall back on other parsers, I ended up removing lxml from my application entirely. Doing so fixed the problem.

like image 924
zzz Avatar asked Feb 20 '12 14:02

zzz


Video Answer


2 Answers

Try to parse utf-8 string instead of unicode.

like image 126
dogada Avatar answered Sep 18 '22 14:09

dogada


As of May 2012 this bug is still present in production, but not in the SDK (1.6.6).

However, rolling back to bs3 bypasses it on production for the time being:

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

like image 31
notreadbyhumans Avatar answered Sep 18 '22 14:09

notreadbyhumans