Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Injecting HTML content into a tag using `lxml.html`

I'm using the lxml.html library to parse an HTML document.

I located a specific tag, that I call content_tag, and I want to change its content (i.e. the text between <div> and </div>,) and the new content is a string with some html in it, say it's 'Hello <b>world!</b>'.

How do I do that? I tried content_tag.text = 'Hello <b>world!</b>' but then it escapes all the html tags, replacing < with &lt; etc.

I want to inject the text without escaping any HTML. How can I do that?

like image 642
Ram Rachum Avatar asked Aug 11 '11 18:08

Ram Rachum


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What does HTML Fromstring do?

fromstring . This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.


1 Answers

This is one way:

#!/usr/bin/env python2.6
from lxml.html import fromstring, tostring
from lxml.html import builder as E
fragment = """\
<div id="outer">
  <div id="inner">This is div.</div>
</div>"""

div = fromstring(fragment)
print tostring(div)
# <div id="outer">
#   <div id="inner">This is div.</div>
# </div>
div.replace(div.get_element_by_id('inner'), E.DIV('Hello ', E.B('world!')))
print tostring(div)
# <div id="outer">
#   <div>Hello <b>world!</b></div></div>

See also: http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory

Edit: So, I should have confessed earlier that I'm not all that familiar with lxml. I looked at the docs and source briefly, but didn't find a clean solution. Perhaps, someone more familiar will stop by and set us both straight.

In the meantime, this seems to work, but is not well tested:

import lxml.html
content_tag = lxml.html.fromstring('<div>Goodbye.</div>')
content_tag.text = '' # assumes only text to start
for elem in lxml.html.fragments_fromstring('Hello <b>world!</b>'):
    if type(elem) == str: #but, only the first?
        content_tag.text += elem
    else:
        content_tag.append(elem)
print lxml.html.tostring(content_tag)

Edit again: and this version removes text and children

somehtml = 'Hello <b>world!</b>'
# purge element contents
content_tag.text = ''
for child in content_tag.getchildren():
    content_tag.remove(child)

fragments = lxml.html.fragments_fromstring(somehtml)
if type(fragments[0]) == str:
    content_tag.text = fragments.pop(0)
content_tag.extend(fragments)
like image 143
Marty Avatar answered Oct 15 '22 22:10

Marty