<p>I'm using the <code>lxml.html</code> library to parse an HTML document.</p> <p>I located a specific tag, that I call <code>content_tag</code>, and I want to change its content (i.e. the text between <code><div></code> and <code></div></code>,) and the new content is a string with some html in it, say it's <code>'Hello <b>world!</b>'</code>.</p> <p>How do I do that? I tried <code>content_tag.text = 'Hello <b>world!</b>'</code> but then it escapes all the html tags, replacing <code><</code> with <code>&lt;</code> etc.</p> <p>I want to inject the text <em>without</em> escaping any HTML. How can I do that?</p>

<p>This is one way: </p> <pre class="prettyprint"><code>#!/usr/bin/env python2.6 from lxml.html import fromstring, tostring from lxml.html import builder as E fragment = """\ <div id="outer"> <div id="inner">This is div.</div> </div>""" div = fromstring(fragment) print tostring(div) # <div id="outer"> # <div id="inner">This is div.</div> # </div> div.replace(div.get_element_by_id('inner'), E.DIV('Hello ', E.B('world!'))) print tostring(div) # <div id="outer"> # <div>Hello <b>world!</b></div></div> </code></pre> <p>See also: http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory</p> <p><strong>Edit:</strong> So, I should have confessed earlier that I'm not all that familiar with lxml. I looked at the docs and source briefly, but didn't find a clean solution. Perhaps, someone more familiar will stop by and set us both straight. </p> <p>In the meantime, this seems to work, but is not well tested: </p> <pre class="prettyprint"><code>import lxml.html content_tag = lxml.html.fromstring('<div>Goodbye.</div>') content_tag.text = '' # assumes only text to start for elem in lxml.html.fragments_fromstring('Hello <b>world!</b>'): if type(elem) == str: #but, only the first? content_tag.text += elem else: content_tag.append(elem) print lxml.html.tostring(content_tag) </code></pre> <p><strong>Edit again:</strong> and this version removes text and children</p> <pre class="prettyprint"><code>somehtml = 'Hello <b>world!</b>' # purge element contents content_tag.text = '' for child in content_tag.getchildren(): content_tag.remove(child) fragments = lxml.html.fragments_fromstring(somehtml) if type(fragments[0]) == str: content_tag.text = fragments.pop(0) content_tag.extend(fragments) </code></pre>

Python: Injecting HTML content into a tag using `lxml.html`

Q: Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Q: What does HTML Fromstring do?

fromstring . This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.

Tags:

python

html

parsing

lxml

I'm using the lxml.html library to parse an HTML document.

I located a specific tag, that I call content_tag, and I want to change its content (i.e. the text between <div> and </div>,) and the new content is a string with some html in it, say it's 'Hello <b>world!</b>'.

How do I do that? I tried content_tag.text = 'Hello <b>world!</b>' but then it escapes all the html tags, replacing < with < etc.

I want to inject the text without escaping any HTML. How can I do that?

642

asked Aug 11 '11 18:08

Ram Rachum

1 Answers

This is one way:

#!/usr/bin/env python2.6
from lxml.html import fromstring, tostring
from lxml.html import builder as E
fragment = """\
<div id="outer">
  <div id="inner">This is div.</div>
</div>"""

div = fromstring(fragment)
print tostring(div)
# <div id="outer">
#   <div id="inner">This is div.</div>
# </div>
div.replace(div.get_element_by_id('inner'), E.DIV('Hello ', E.B('world!')))
print tostring(div)
# <div id="outer">
#   <div>Hello <b>world!</b></div></div>

See also: http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory

Edit: So, I should have confessed earlier that I'm not all that familiar with lxml. I looked at the docs and source briefly, but didn't find a clean solution. Perhaps, someone more familiar will stop by and set us both straight.

In the meantime, this seems to work, but is not well tested:

import lxml.html
content_tag = lxml.html.fromstring('<div>Goodbye.</div>')
content_tag.text = '' # assumes only text to start
for elem in lxml.html.fragments_fromstring('Hello <b>world!</b>'):
    if type(elem) == str: #but, only the first?
        content_tag.text += elem
    else:
        content_tag.append(elem)
print lxml.html.tostring(content_tag)

Edit again: and this version removes text and children

somehtml = 'Hello <b>world!</b>'
# purge element contents
content_tag.text = ''
for child in content_tag.getchildren():
    content_tag.remove(child)

fragments = lxml.html.fragments_fromstring(somehtml)
if type(fragments[0]) == str:
    content_tag.text = fragments.pop(0)
content_tag.extend(fragments)

143

answered Oct 15 '22 22:10

Marty

Related questions
                            
                                Where can I find a good tutorial for py2exe? [closed]
                            
                                Is there a Python equivalent to Java's AWT Robot class? [closed]
                            
                                Python C API: how to get string representation of exception?
                            
                                Python with matplotlib - reusing drawing functions
                            
                                Neural Network Example Source-code (preferably Python) [closed]
                            
                                Loading SQL dump before running Django tests
                            
                                Match multiline regex in file object
                            
                                Python regex to convert non-ascii characters in a string to closest ascii equivalents
                            
                                Python parse date format, ignore parts of string
                            
                                Python and d-bus: How to set up main loop?
                            
                                python library for splitting video
                            
                                Are there any radix/patricia/critbit trees for Python?
                            
                                Cythonize a Python function to make it faster
                            
                                lxml parser eats all memory
                            
                                Why is this shell script calling itself as python script?
                            
                                Vectorize over the rows of an array
                            
                                Easiest Way to Transfer Data Over the Internet, Python
                            
                                Working with WTForms FieldList
                            
                                Compiling vim with Python3 (installed via Homebrew) support?
                            
                                Loading document as raw string in yaml with PyYAML

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With