I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:
<html>
<table>
<tr><td>Header</td></tr>
<tr><td>Want This</td></tr>
</table>
</html>
so lets try it:
>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>
that looks good, lets see what else we have:
>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>
LOL WUT?
seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.
If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.
Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?
Here is a way to do this with lxml:
from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]
Result:
['Header', 'Want This']
What you want to use is the namespaceHTMLElements
argument, which for some reason defaults to True.
doc = html5lib.parse('''<html>
<table>
<tr><td>Header</td></tr>
<tr><td>Want This</td></tr>
</table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)
print lxml.html.tostring(doc)
It's probably still easier to use lxml.html however.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With