Parsing UTF-8/unicode strings with lxml HTML

Tags:

I have been trying to parse with etree.HTML() a text encoded as UTF-8 without success.

→ python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)  [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> import requests >>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"} >>> r = requests.get("http://www.rakuten.co.jp/", headers=headers) >>> r.status_code 200 >>> r.headers {'x-cache': 'MISS from www.rakuten.co.jp', 'transfer-encoding': 'chunked', 'set-cookie': 'wPzd=lng%3DNA%3Acnt%3DCA; expires=Tue, 13-Aug-2013 16:51:38 GMT; path=/; domain=www.rakuten.co.jp', 'server': 'Apache', 'pragma': 'no-cache', 'cache-control': 'private', 'date': 'Mon, 13 Aug 2012 16:51:38 GMT', 'content-type': 'text/html; charset=EUC-JP'} >>> responsetext = r.text

So far so good. The response text is good and it is a unicode string. Now if I'm trying to get the list of CSS URIs. No issue either.

>>> tree = etree.HTML(responsetext) >>> csspathlist = tree.xpath('//link[@rel="stylesheet"]/@href') >>> csspathlist ['http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/common.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/layout.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/sidecolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/api.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/myrakuten_dpgs.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/leftcolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/header.css?v=1207111500', '/com/inc/home/20080930/opt/css/normal/footer.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/ipad.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/genre.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/supersale.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/rakuten_membership.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/noscript/set.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/suggest-2.0.1.css?v=1204231500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/liquid_banner.css?v=1203011138', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/area_announce.css?v=1203011138']

Now let's change from unicode to UTF-8 and request again the list of CSS URIs.

>>> htmltext = responsetext.encode('utf-8') >>> tree2 = etree.HTML(htmltext) >>> csspathlist2 = tree2.xpath('//link[@rel="stylesheet"]/@href') >>> csspathlist2 []

I get an empty list.

>>> etree.tostring(tree2) '<html lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Script-Type" content="text/javascript"/><title/></head></html>'

Indeed, the second parsing stopped right away after the first Japanese character in the title.

<meta http-equiv="Content-Script-Type" content="text/javascript"/> <title> 【楽天市場】Shopping is Entertainment! ： インターネット最大級の通信販売、通販オンラインショッピングコミュニティ </title>

I'm still trying to understand what I have done wrong.

648

asked Aug 13 '12 17:08

karlcow

1 Answers

Ok and just found. Writing the question on StackOverflow helps often.

etree.HTML() is trying to guess the encoding according to the meta in the document

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/>

In this case, I have converted manually the document to utf-8, which means it is not anymore the Japanese encoding: EUC-JP. So to solve the issue is just a matter of forcing the HTML parser to understand utf-8. In our case the code becomes:

>>> myparser = etree.HTMLParser(encoding="utf-8") >>> tree = etree.HTML(htmltext, parser=myparser)

119

answered Oct 03 '22 20:10

karlcow

Related questions
                            
                                Code formatter / beautifier for bash (in command line)?
                            
                                Set title for contextual action bar
                            
                                bash—Better way to store variable between runs?
                            
                                Foolproof way to detect if iframe is cross domain
                            
                                On iOS, drawRect cannot draw outside of the view's bounds?
                            
                                How do I print an integer in Assembly Level Programming without printf from the c library?
                            
                                MSBuild UsingTask Resolve References
                            
                                Are floating point operations in C associative?
                            
                                What type for subtracting 2 size_t's?
                            
                                Merging pull requests together
                            
                                Why do I get access violations when a control's class name is very, very long?
                            
                                Performance difference between IPC shared memory and threads memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing UTF-8/unicode strings with lxml HTML

Tags:

karlcow

People also ask

1 Answers

karlcow

Recent Activity

Donate For Us