I ran into a somewhat complicated XPath problem. Consider this HTML of part of a web page (I used Imgur and replaced some text):
<a href="//i.imgur.com/ahreflink.jpg" class="zoom">
<img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
</img>
</a>
I first want to search all img
tags in the document and finding their corresponding src
es. Next, I want to check if the img src
link contains an image file extension (.jpeg, .jpg, .gif, .png). If it doesn't contain an image extension, don't grab it. In this case it has an image extension. Now we want to figure out which link we want to grab. Since the parent href
exists, we should grab the corresponding link.
Desired Result: //i.imgur.com/ahreflink.jpg
But now let's say the parent href
doesn't exist:
<a name="missing! oh no!">
<img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg">
</img>
</a>
Desired Result: //i.imgur.com/imgsrclink.jpg
How do I go about constructing this XPath? If it helps, I am also using Python (Scrapy) with XPath. So if the problem needs to be separated out, Python can be used as well.
This is very simple to do in a single xpath expression:
//a[not(@href)]/img/@src | //a[img]/@href
You don't have to do it in a single XPath expression. Here is a Scrapy specific implementation omitting the image extension check (judging by the comments, you've already figured that out):
images = response.xpath("//a/img")
for image in images:
a_link = image.xpath("../@href").extract_first()
image_link = image.xpath("@src").extract_first()
print(a_link or image_link)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With