XPath select image links - parent href link of img src only if it exists, else select img src link

Question

I ran into a somewhat complicated XPath problem. Consider this HTML of part of a web page (I used Imgur and replaced some text):

<a href="//i.imgur.com/ahreflink.jpg" class="zoom">
    <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
    </img>
</a>

I first want to search all img tags in the document and finding their corresponding srces. Next, I want to check if the img src link contains an image file extension (.jpeg, .jpg, .gif, .png). If it doesn't contain an image extension, don't grab it. In this case it has an image extension. Now we want to figure out which link we want to grab. Since the parent href exists, we should grab the corresponding link.

Desired Result: //i.imgur.com/ahreflink.jpg

But now let's say the parent href doesn't exist:

<a name="missing! oh no!">
    <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
    </img>
</a>

Desired Result: //i.imgur.com/imgsrclink.jpg

How do I go about constructing this XPath? If it helps, I am also using Python (Scrapy) with XPath. So if the problem needs to be separated out, Python can be used as well.

o11c · Accepted Answer

This is very simple to do in a single xpath expression:

//a[not(@href)]/img/@src | //a[img]/@href

alecxe · Answer

You don't have to do it in a single XPath expression. Here is a Scrapy specific implementation omitting the image extension check (judging by the comments, you've already figured that out):

images = response.xpath("//a/img")
for image in images:
    a_link = image.xpath("../@href").extract_first()
    image_link = image.xpath("@src").extract_first()

    print(a_link or image_link)

XPath select image links - parent href link of img src only if it exists, else select img src link

Tags:

python

web-scraping

xpath

scrapy

dtgee

2 Answers

o11c

alecxe

Recent Activity

Donate For Us

XPath select image links - parent href link of img src only if it exists, else select img src link

Tags:

python

web-scraping

xpath

scrapy

dtgee

2 Answers

o11c

alecxe

Related questions

Recent Activity

Donate For Us