I need help to convert relative URL to absolute URL in Scrapy spider.
I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion?
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/billboard",
"http://www.example.com/billboard?page=1"
]
def parse(self, response):
image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()
for image_url, url in zip(image_urls, absolute_urls):
item = ExampleItem()
item['image_urls'] = image_urls
request = Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
An absolute URL contains more information than a relative URL does. Relative URLs are more convenient because they are shorter and often more portable. However, you can use them only to reference links on the same server as the page that contains them.
An absolute URL contains all the information necessary to locate a resource. A relative URL locates a resource using an absolute URL as a starting point. In effect, the "complete URL" of the target is specified by concatenating the absolute and relative URLs.
An absolute URL is the full URL, including protocol ( http / https ), the optional subdomain (e.g. www ), domain ( example.com ), and path (which includes the directory and slug).
There are mainly three ways to achieve that:
Using urljoin
function from urllib
:
from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin
url = urljoin(base_url, relative_url)
Using the response's urljoin
wrapper method, as mentioned by Steve.
url = response.urljoin(relative_url)
If you also want to yield a request from that link, you can use the handful response's follow
method:
# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With