I have a HTML file:
<html>
<p>somestr
<sup>1</sup>
anotherstr
</p>
</html>
I would like to extract the text as:
somestr1anotherstr
but I can't figure out how to do it. I have written a to_sup()
function that converts numeric strings to superscript so the closest I get is something like:
for i in doc.xpath('.//p/text()|.//sup/text()'):
if i.tag == 'sup':
print to_sup(i),
else:
print i,
but I ElementStringResult
doesn't seem to have a method to get the tag name, so I am a bit lost. Any ideas how to solve it?
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.
first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags):
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
this one helped me - concatenation the way I needed:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With