Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing UTF-8/unicode strings with lxml HTML

Tags:

I have been trying to parse with etree.HTML() a text encoded as UTF-8 without success.

→ python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)  [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> import requests >>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"} >>> r = requests.get("http://www.rakuten.co.jp/", headers=headers) >>> r.status_code 200 >>> r.headers {'x-cache': 'MISS from www.rakuten.co.jp', 'transfer-encoding': 'chunked', 'set-cookie': 'wPzd=lng%3DNA%3Acnt%3DCA; expires=Tue, 13-Aug-2013 16:51:38 GMT; path=/; domain=www.rakuten.co.jp', 'server': 'Apache', 'pragma': 'no-cache', 'cache-control': 'private', 'date': 'Mon, 13 Aug 2012 16:51:38 GMT', 'content-type': 'text/html; charset=EUC-JP'} >>> responsetext = r.text 

So far so good. The response text is good and it is a unicode string. Now if I'm trying to get the list of CSS URIs. No issue either.

>>> tree = etree.HTML(responsetext) >>> csspathlist = tree.xpath('//link[@rel="stylesheet"]/@href') >>> csspathlist ['http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/common.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/layout.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/sidecolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/api.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/myrakuten_dpgs.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/leftcolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/header.css?v=1207111500', '/com/inc/home/20080930/opt/css/normal/footer.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/ipad.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/genre.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/supersale.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/rakuten_membership.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/noscript/set.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/suggest-2.0.1.css?v=1204231500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/liquid_banner.css?v=1203011138', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/area_announce.css?v=1203011138'] 

Now let's change from unicode to UTF-8 and request again the list of CSS URIs.

>>> htmltext = responsetext.encode('utf-8') >>> tree2 = etree.HTML(htmltext) >>> csspathlist2 = tree2.xpath('//link[@rel="stylesheet"]/@href') >>> csspathlist2 [] 

I get an empty list.

>>> etree.tostring(tree2) '<html lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Script-Type" content="text/javascript"/><title/></head></html>' 

Indeed, the second parsing stopped right away after the first Japanese character in the title.

<meta http-equiv="Content-Script-Type" content="text/javascript"/> <title> 【楽天市場】Shopping is Entertainment! : インターネット最大級の通信販売、通販オンラインショッピングコミュニティ </title> 

I'm still trying to understand what I have done wrong.

like image 648
karlcow Avatar asked Aug 13 '12 17:08

karlcow


People also ask

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

What does lxml do in BeautifulSoup?

lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup, and convert_tree() to convert an existing BeautifulSoup tree into a list of top-level Elements.


1 Answers

Ok and just found. Writing the question on StackOverflow helps often.

etree.HTML() is trying to guess the encoding according to the meta in the document

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/> 

In this case, I have converted manually the document to utf-8, which means it is not anymore the Japanese encoding: EUC-JP. So to solve the issue is just a matter of forcing the HTML parser to understand utf-8. In our case the code becomes:

>>> myparser = etree.HTMLParser(encoding="utf-8") >>> tree = etree.HTML(htmltext, parser=myparser) 
like image 119
karlcow Avatar answered Oct 03 '22 20:10

karlcow