Extract text with lxml.html

Tags:

python

lxml

I have a HTML file:

<html>
    <p>somestr
        <sup>1</sup>
       anotherstr
    </p>
</html>

I would like to extract the text as:

somestr¹anotherstr

but I can't figure out how to do it. I have written a to_sup() function that converts numeric strings to superscript so the closest I get is something like:

for i in doc.xpath('.//p/text()|.//sup/text()'):
    if i.tag == 'sup':
        print to_sup(i),
    else:
        print i,

but I ElementStringResult doesn't seem to have a method to get the tag name, so I am a bit lost. Any ideas how to solve it?

204

asked Dec 17 '12 10:12

root

1 Answers

first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags):

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

196

answered Oct 19 '22 16:10

Robert Lujo

Related questions
                            
                                PEP 8: How should __future__ imports be grouped?
                            
                                Python: Change list type for json decoding
                            
                                What's meaning of these formats in twisted's docstring?
                            
                                3d numpy record array
                            
                                What's the difference between /usr/lib/python and /usr/lib64/python?
                            
                                My own method used in list_display and value as boolean icon
                            
                                Cannot import Scikit-Learn
                            
                                matplotlib autoscale axes to include annotations
                            
                                Why use multiple arguments to log instead of interpolation?
                            
                                A QWidget like QTextEdit that wraps its height automatically to its contents?
                            
                                How to get a file object from mkstemp()?
                            
                                Flask and WTForms - how to get wtforms to refresh select data
                            
                                python regular expression matching anything
                            
                                Use lxml to parse text file with bad header in Python
                            
                                Selenium WebDriver (2.25) Timeout Not Working
                            
                                How do I display and close an image with Python?
                            
                                Data type error with drawContours unless I pickle/unpickle first
                            
                                Dynamically change widget background color in Tkinter
                            
                                python compare datetimes with different timezones
                            
                                Python regex compile (with re.VERBOSE) not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With