Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath select image links - parent href link of img src only if it exists, else select img src link

I ran into a somewhat complicated XPath problem. Consider this HTML of part of a web page (I used Imgur and replaced some text):

<a href="//i.imgur.com/ahreflink.jpg" class="zoom">
    <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
    </img>
</a>

I first want to search all img tags in the document and finding their corresponding srces. Next, I want to check if the img src link contains an image file extension (.jpeg, .jpg, .gif, .png). If it doesn't contain an image extension, don't grab it. In this case it has an image extension. Now we want to figure out which link we want to grab. Since the parent href exists, we should grab the corresponding link.

Desired Result: //i.imgur.com/ahreflink.jpg

But now let's say the parent href doesn't exist:

<a name="missing! oh no!">
    <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
    </img>
</a>

Desired Result: //i.imgur.com/imgsrclink.jpg

How do I go about constructing this XPath? If it helps, I am also using Python (Scrapy) with XPath. So if the problem needs to be separated out, Python can be used as well.

like image 285
dtgee Avatar asked Jun 24 '16 03:06

dtgee


2 Answers

This is very simple to do in a single xpath expression:

//a[not(@href)]/img/@src | //a[img]/@href
like image 79
o11c Avatar answered Oct 31 '22 05:10

o11c


You don't have to do it in a single XPath expression. Here is a Scrapy specific implementation omitting the image extension check (judging by the comments, you've already figured that out):

images = response.xpath("//a/img")
for image in images:
    a_link = image.xpath("../@href").extract_first()
    image_link = image.xpath("@src").extract_first()

    print(a_link or image_link)
like image 43
alecxe Avatar answered Oct 31 '22 03:10

alecxe