Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml and <noscript> in <head>

I got a strange bug with lxml:

>>> s = '<html><head><noscript></noscript><script></script><meta></head></html>' 
>>> root = lxml.html.fromstring(s)
>>> root.xpath('/html/head/meta')
>>> root.xpath('/html/body/meta')
[<Element meta at 0x2a92788>]

meta tag should in head element, not body. How can I get correct element in this situation?


2 Answers

Let me guess: are you using old version of Ubuntu (like 12.04)? Actually, it's a bug in old version of preinstalled libxml2 library used by lxml package. In the release notes for version 2.8.0 they mention fix for HTML parser error with <noscript> in the <head> - so I guess version of libxml2 >= 2.8.0 should work. Ubuntu 12.04 has version 2.7.8 installed.

>>> import lxml.etree
>>> lxml.etree.LIBXML_COMPILED_VERSION
(2, 7, 8)
>>> lxml.etree.LIBXML_VERSION
(2, 9, 1)

I think if any of these versions are >=2.8.0, the <noscript> issue should be gone.

like image 161
Palasaty Avatar answered Dec 11 '25 22:12

Palasaty


This works for me:

import lxml.html

s = '<html><head><noscript></noscript><script></script><meta></head></html>' 
root = lxml.html.fromstring(s)
print(root.xpath('/html/head/meta'))
print(root.xpath('/html/body/meta'))

Output:

[<Element meta at 0x10a123b8>]
[]

I'm using Python 2.7.9 and lxml version 3.4.2.

like image 40
gtlambert Avatar answered Dec 11 '25 22:12

gtlambert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!