am trying to use lxml to read html from a string and then try to find all img tags, update the image src's attribute and add hyper link around each image found
so this,
<img src="old-value" />
will be this
<a href=""><img src="new-value" /></a>
the problem am facing is two, first am using etree.HTML to load the html string, which for some reason is adding html tag and body tag to the html itself. Is there a way to load it without automatically causing this to happen?
Another problem am not able to solve, how do i add the hyper link element around the image tag, I tried the below but it would add the hyper link element inside the img tag
tree = etree.HTML(self.content)
imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
img.set('src', thm)
a = etree.Element('a', href="#")
img.insert(0, a)
Any one can advise please?
update:
I just tried the approach provided by @Alko and its working well, but it has a problem with the type of content am using.
The img tag is located inside p tags such as example below
<html><body><p><img src="/public_media/cache/66/ed/66edd1c01e3027ba18bef9244ca8e8b4.jpg?id=31"/>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p><p>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh
skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh
skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p></body></html>
whats happening when i run the solution given, the closing a tag is being added after the ending of the paragraph.
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.
You can use addprevious
before of insert:
imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
img.set('src', thm)
a = etree.Element('a', href="#")
img.addprevious(a)
a.insert(0, img)
That will result in
>>> etree.tostring(tree)
'<html><body><a href="#"><img src="new-value"/></a></body></html>'
Also, lxml.html.fragment_fromstring
can be useful, but you have to provide more diverse example, as in your case of alone image element, it won't be found by your xpath.
See following demo:
>>> import lxml.html
>>> img = lxml.html.fragment_fromstring('<img src="old-value" />')
>>> thm = "new-value"
>>> img.set('src', thm)
>>> a = etree.Element('a', href="#")
>>> a.insert(0, img)
>>> lxml.html.etree.tostring(a)
'<a href="#"><img src="new-value"/></a>'
Update
For a case when img
tag has tail, you can reassign it to created a
tag:
>>> s = '<html><body><p><img src="old_value"/>some text</p></body></html>'
>>> tree = etree.HTML(s)
>>> imgs = tree.xpath('.//img')
>>> thm = "new-value"
>>> for img in imgs:
... img.set('src', thm)
... a = etree.Element('a', href="#")
... img.addprevious(a)
... a.insert(0, img)
... a.tail = img.tail
... img.tail = ''
...
>>> etree.tostring(tree)
'<html><body><p><a href="#"><img src="new-value"/></a>some text</p></body></html>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With