<p>I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:</p> <pre class="prettyprint"><code><html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> </code></pre> <p>so lets try it:</p> <pre class="prettyprint"><code>>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml') >>> doc <lxml.etree._ElementTree object at 0x1a1c290> </code></pre> <p>that looks good, lets see what else we have:</p> <pre class="prettyprint"><code>>>> root = doc.getroot() >>> print(lxml.etree.tostring(root)) <html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html> </code></pre> <p>LOL WUT?</p> <p>seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.</p>

<p>Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?</p> <p>Here is a way to do this with lxml:</p> <pre class="prettyprint"><code>from lxml import html tree = html.fromstring(text) [td.text for td in tree.xpath("//td")] </code></pre> <p>Result:</p> <pre class="prettyprint"><code>['Header', 'Want This'] </code></pre>

<p>What you want to use is the <code>namespaceHTMLElements</code> argument, which for some reason defaults to True.</p> <pre class="prettyprint"><code>doc = html5lib.parse('''<html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> ''', treebuilder='lxml', namespaceHTMLElements=False) print lxml.html.tostring(doc) </code></pre> <p>It's probably still easier to use lxml.html however.</p>

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

Tags:

python

parsing

xpath

lxml

html5lib

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

297

asked Apr 01 '10 04:04

Dan.StackOverflow

2 Answers

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

['Header', 'Want This']

154

answered Sep 23 '22 13:09

Ryan Ginstrom

What you want to use is the namespaceHTMLElements argument, which for some reason defaults to True.

doc = html5lib.parse('''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)

print lxml.html.tostring(doc)

It's probably still easier to use lxml.html however.

answered Sep 21 '22 13:09

sciyoshi

Related questions
                            
                                Python: can unittest display expected and actual values?
                            
                                filtering dropdown values in django admin
                            
                                itertools.groupby in a django template
                            
                                Monitoring Rsync Progress
                            
                                Python: Pinpointing the Linear Part of a Slope [closed]
                            
                                Query whether Python's threading.Lock is locked or not
                            
                                python module import - relative paths issue
                            
                                python version 3.4 does not support a 'ur' prefix
                            
                                Appropriate choice of authentication class for python REST API used by web app
                            
                                How do I make Python3 the default Python in Geany
                            
                                Change Tkinter Frame Title [duplicate]
                            
                                Django Localhost CORS not working
                            
                                dill vs cPickle speed difference
                            
                                using import inside class
                            
                                Pandas round is not working for DataFrame
                            
                                Python Gmail API 'not JSON serializable'
                            
                                Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28]
                            
                                How to get parent folder name of current directory?
                            
                                How to remove special characters except space from a file in python?
                            
                                Install PyTorch from requirements.txt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With