Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text with lxml.html

Tags:

python

lxml

I have a HTML file:

<html>
    <p>somestr
        <sup>1</sup>
       anotherstr
    </p>
</html>

I would like to extract the text as:

somestr1anotherstr

but I can't figure out how to do it. I have written a to_sup() function that converts numeric strings to superscript so the closest I get is something like:

for i in doc.xpath('.//p/text()|.//sup/text()'):
    if i.tag == 'sup':
        print to_sup(i),
    else:
        print i,

but I ElementStringResult doesn't seem to have a method to get the tag name, so I am a bit lost. Any ideas how to solve it?

like image 204
root Avatar asked Dec 17 '12 10:12

root


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is lxml faster than BeautifulSoup?

It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.

Is XML and lxml are same?

lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.


1 Answers

first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags):

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))
like image 196
Robert Lujo Avatar answered Oct 19 '22 16:10

Robert Lujo