Relative URL to absolute URL Scrapy

Tags:

scrapy

I need help to convert relative URL to absolute URL in Scrapy spider.

I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion?

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/billboard",
        "http://www.example.com/billboard?page=1"
    ]

def parse(self, response):
    image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
    relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()

    for image_url, url in zip(image_urls, absolute_urls):
        item = ExampleItem()
        item['image_urls'] = image_urls

    request = Request(url, callback=self.parse_dir_contents)
    request.meta['item'] = item
    yield request

244

asked Mar 18 '16 13:03

jacquesseite

Video Answer

1 Answers

There are mainly three ways to achieve that:

Using urljoin function from urllib:

from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin

url = urljoin(base_url, relative_url)

Using the response's urljoin wrapper method, as mentioned by Steve.
```
url = response.urljoin(relative_url)
```

If you also want to yield a request from that link, you can use the handful response's follow method:

# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)

159

answered Nov 02 '22 19:11

Paulo Romeira

Related questions
                            
                                scrapy response.xpath returns empty array on xml document with default namespace, while response.re works
                            
                                learning python and also trying to implement scrapy ..getting this error
                            
                                Getting scrapy project settings when script is outside of root directory
                            
                                Avoid Duplicate URL Crawling
                            
                                How do I remove a query from a url?
                            
                                Yield Request call produce weird result in recursive method with scrapy
                            
                                How to get cookie from scrapy response and set the cookie to the next request?
                            
                                Scrapy Vs Nutch [closed]
                            
                                httplib.BadStatusLine: ''
                            
                                Make Scrapy follow links and collect data
                            
                                Scrapy, only follow internal URLS but extract all links found
                            
                                Scrapy: How to manually insert a request from a spider_idle event callback?
                            
                                Scrapy CrawlSpider doesn't crawl the first landing page
                            
                                Write to a csv file scrapy
                            
                                Creating a generic scrapy spider
                            
                                CSV Exports - Ordering of columns using scrapy crawl -o output.csv
                            
                                Installing scrapy/pyopenssl in Windows' virtualenv
                            
                                is Scrapy single-threaded or multi-threaded?
                            
                                Items vs item loaders in scrapy
                            
                                Scrapy: Define items dynamically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With