am trying to use lxml to read html from a string and then try to find all img tags, update the image src's attribute and add hyper link around each image found so this, <pre class="prettyprint"><code><img src="old-value" /> </code></pre> will be this <pre class="prettyprint"><code><a href=""><img src="new-value" /></a> </code></pre> the problem am facing is two, first am using etree.HTML to load the html string, which for some reason is adding html tag and body tag to the html itself. Is there a way to load it without automatically causing this to happen? Another problem am not able to solve, how do i add the hyper link element around the image tag, I tried the below but it would add the hyper link element inside the img tag <pre class="prettyprint"><code>tree = etree.HTML(self.content) imgs = tree.xpath('.//img') thm = "new-value" for img in imgs: img.set('src', thm) a = etree.Element('a', href="#") img.insert(0, a) </code></pre> Any one can advise please? update: I just tried the approach provided by @Alko and its working well, but it has a problem with the type of content am using. The img tag is located inside p tags such as example below <pre class="prettyprint"><code><html><body><img src="/public_media/cache/66/ed/66edd1c01e3027ba18bef9244ca8e8b4.jpg?id=31"/>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh skjhs kjsh skjh skjh ksj ksjh jsk hskjh sjshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh&#13; skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh &#13; skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</body></html> </code></pre> whats happening when i run the solution given, the closing a tag is being added after the ending of the paragraph.

You can use <code>addprevious</code> before of insert: <pre class="prettyprint"><code>imgs = tree.xpath('.//img') thm = "new-value" for img in imgs: img.set('src', thm) a = etree.Element('a', href="#") img.addprevious(a) a.insert(0, img) </code></pre> That will result in <pre class="prettyprint"><code>>>> etree.tostring(tree) '<html><body><a href="#"><img src="new-value"/></a></body></html>' </code></pre> Also, <code>lxml.html.fragment_fromstring</code> can be useful, but you have to provide more diverse example, as in your case of alone image element, it won't be found by your xpath. See following demo: <pre class="prettyprint"><code>>>> import lxml.html >>> img = lxml.html.fragment_fromstring('<img src="old-value" />') >>> thm = "new-value" >>> img.set('src', thm) >>> a = etree.Element('a', href="#") >>> a.insert(0, img) >>> lxml.html.etree.tostring(a) '<a href="#"><img src="new-value"/></a>' </code></pre> Update For a case when <code>img</code> tag has tail, you can reassign it to created <code>a</code> tag: <pre class="prettyprint"><code>>>> s = '<html><body><img src="old_value"/>some text</body></html>' >>> tree = etree.HTML(s) >>> imgs = tree.xpath('.//img') >>> thm = "new-value" >>> for img in imgs: ... img.set('src', thm) ... a = etree.Element('a', href="#") ... img.addprevious(a) ... a.insert(0, img) ... a.tail = img.tail ... img.tail = '' ... >>> etree.tostring(tree) '<html><body><a href="#"><img src="new-value"/></a>some text</body></html>' </code></pre>

lxml python load html string without header and body and add element around targeted elements

Tags:

python

lxml

am trying to use lxml to read html from a string and then try to find all img tags, update the image src's attribute and add hyper link around each image found

so this,

<img src="old-value" />

will be this

<a href=""><img src="new-value" /></a>

the problem am facing is two, first am using etree.HTML to load the html string, which for some reason is adding html tag and body tag to the html itself. Is there a way to load it without automatically causing this to happen?

Another problem am not able to solve, how do i add the hyper link element around the image tag, I tried the below but it would add the hyper link element inside the img tag

tree = etree.HTML(self.content)
imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
     img.set('src', thm)
     a = etree.Element('a', href="#")
     img.insert(0, a)

Any one can advise please?

update:

I just tried the approach provided by @Alko and its working well, but it has a problem with the type of content am using.

The img tag is located inside p tags such as example below

<html><body><p><img src="/public_media/cache/66/ed/66edd1c01e3027ba18bef9244ca8e8b4.jpg?id=31"/>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p><p>jshjksh skjhs jksh skjhsj ksh jkshs kjhs kjsh sjkhs khs ksh skh skh skjh&#13;
 skjh skjh ksjh ksh skhs kjsh skjh skhs khs kjsh skjh skjhs kshk sjh &#13;
skjhs kjsh skjh skjh ksj ksjh jsk hskjh s</p></body></html>

whats happening when i run the solution given, the closing a tag is being added after the ending of the paragraph.

273

asked Dec 17 '13 15:12

Mo J. Mughrabi

1 Answers

You can use addprevious before of insert:

imgs = tree.xpath('.//img')
thm = "new-value"
for img in imgs:
    img.set('src', thm)
    a = etree.Element('a', href="#")
    img.addprevious(a)
    a.insert(0, img)

That will result in

>>> etree.tostring(tree)
'<html><body><a href="#"><img src="new-value"/></a></body></html>'

Also, lxml.html.fragment_fromstring can be useful, but you have to provide more diverse example, as in your case of alone image element, it won't be found by your xpath.

See following demo:

>>> import lxml.html
>>> img = lxml.html.fragment_fromstring('<img src="old-value" />')
>>> thm = "new-value"
>>> img.set('src', thm)
>>> a = etree.Element('a', href="#")
>>> a.insert(0, img)
>>> lxml.html.etree.tostring(a)
'<a href="#"><img src="new-value"/></a>'

Update

For a case when img tag has tail, you can reassign it to created a tag:

>>> s = '<html><body><p><img src="old_value"/>some text</p></body></html>'
>>> tree = etree.HTML(s)
>>> imgs = tree.xpath('.//img')
>>> thm = "new-value"
>>> for img in imgs:
...     img.set('src', thm)
...     a = etree.Element('a', href="#")
...     img.addprevious(a)
...     a.insert(0, img)
...     a.tail = img.tail
...     img.tail = ''
...
>>> etree.tostring(tree)
'<html><body><p><a href="#"><img src="new-value"/></a>some text</p></body></html>'

answered Sep 28 '22 12:09

alko

Related questions
                            
                                Flask static file to absolute path
                            
                                RaspberryPI Python WiringPi2 Interrupt Syntax
                            
                                python PDFminer only parses part of the page
                            
                                Python subprocess hangs with named pipes
                            
                                Iterate over list of dicts in order of property [duplicate]
                            
                                Counting phrase frequency in Python 3.3.2
                            
                                How to print floating point numbers as it is without any truncation in python?
                            
                                Passing objects around an event queue in Python
                            
                                Python with MySql "SAWarning: Unicode type received non-unicode bind param value" error
                            
                                Algorithm optimization to find possible aminoacid sequences with total mass m [duplicate]
                            
                                python library to beep motherboard speaker
                            
                                Multi-label classification for large dataset
                            
                                Detect ARP poisoning using scapy
                            
                                Python- Reportlabs - save 2 different graphs in 2 different pages?
                            
                                How to automate satellite image downloads?
                            
                                Python ctypes bitfields
                            
                                How to estimate local tangent plane for 3d points?
                            
                                'Isomorphic' comparison of NetworkX Graph objects instead of the default 'address' comparison
                            
                                Python Zbar DLL load fail
                            
                                Pygame: Can someone help me implement double jumping?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With