Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting attributes from html with lxml

Tags:

python

html

lxml

I use lxml to retrieve the attributes of tags from an html page. The html page is formatted like this:

<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>

The python script I use to retrieve the url inside the <a> tag and the src value of the <img> tag inside the same <div>, is this:

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.xpath('/@href')
    src = element.xpath('//img/@src')

Why don't I get the strings?

like image 769
Ganjalf Avatar asked Nov 21 '14 20:11

Ganjalf


1 Answers

You are using lxml so you are operating with lxml objects - HtmlElement instances. HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html, it have get method, that returns attrubute value. So the proper way for you is:

from lxml import html 

...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = link_element.get('href')
    image_element = href.find('img')
    if image_element:
        img_src = image_element.get('src') 
like image 172
Alex Pertsev Avatar answered Oct 06 '22 12:10

Alex Pertsev