Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relative URL to absolute URL Scrapy

Tags:

scrapy

I need help to convert relative URL to absolute URL in Scrapy spider.

I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion?

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/billboard",
        "http://www.example.com/billboard?page=1"
    ]

def parse(self, response):
    image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
    relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()

    for image_url, url in zip(image_urls, absolute_urls):
        item = ExampleItem()
        item['image_urls'] = image_urls

    request = Request(url, callback=self.parse_dir_contents)
    request.meta['item'] = item
    yield request
like image 244
jacquesseite Avatar asked Mar 18 '16 13:03

jacquesseite


People also ask

Is absolute URL better than relative?

An absolute URL contains more information than a relative URL does. Relative URLs are more convenient because they are shorter and often more portable. However, you can use them only to reference links on the same server as the page that contains them.

What is absolute URL relative URL?

An absolute URL contains all the information necessary to locate a resource. A relative URL locates a resource using an absolute URL as a starting point. In effect, the "complete URL" of the target is specified by concatenating the absolute and relative URLs.

How do you make an absolute URL?

An absolute URL is the full URL, including protocol ( http / https ), the optional subdomain (e.g. www ), domain ( example.com ), and path (which includes the directory and slug).


Video Answer


1 Answers

There are mainly three ways to achieve that:

  1. Using urljoin function from urllib:

    from urllib.parse import urljoin
    # Same as: from w3lib.url import urljoin
    
    url = urljoin(base_url, relative_url)
    
  2. Using the response's urljoin wrapper method, as mentioned by Steve.

    url = response.urljoin(relative_url)
    
  3. If you also want to yield a request from that link, you can use the handful response's follow method:

    # It will create a new request using the above "urljoin" method
    yield response.follow(relative_url, callback=self.parse)
    
like image 159
Paulo Romeira Avatar answered Nov 02 '22 19:11

Paulo Romeira