I use lxml
to retrieve the attributes of tags from an html page. The html page is formatted like this:
<div class="my_div">
<a href="/foobar">
<img src="my_img.png">
</a>
</div>
The python script I use to retrieve the url inside the <a>
tag and the src
value of the <img>
tag inside the same <div>
, is this:
from lxml import html
...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
href = element.xpath('/@href')
src = element.xpath('//img/@src')
Why don't I get the strings?
You are using lxml so you are operating with lxml objects - HtmlElement instances. HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html, it have get method, that returns attrubute value. So the proper way for you is:
from lxml import html
...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
href = link_element.get('href')
image_element = href.find('img')
if image_element:
img_src = image_element.get('src')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With