Scrapy crawl nested urls

Question

Introduction

As i have to go more deeper in crawling, i face my next problem: crawling nested pages like: https://www.karton.eu/Faltkartons

My crawler has to start at this page, goes to https://www.karton.eu/Einwellige-Kartonagen and visit every product listed in this category.

It should do that with every subcategory of "Faltkartons" for every single product contained in every category.

EDITED

My code now looks like this:

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center articelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        items = KartonageItem()
        items = response.meta['items']
        items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

In my head it starts at the the start_url, then i visits https://www.karton.eu/Einwellige-Kartonagen, looking for links and follow them to https://www.karton.eu/einwellig-ab-100-mm.On that page it checks the cards for some information and follow link to the specific product page to get the last items.

Which part(s) of my method is/are wrong? Should i change my class from "scrapy.Spider" to "crawl.spider"? or is this only needed if i want to set some rules?

It could be still possible, that my xpaths of the title,sku etc may be wrong, but at the very first, i want just build my basics, to crawl these nested pages

My console output:

console output

finally i managed to go through all these pages, but somehow my .csv-file is still empty

renatodvc · Accepted Answer

According to the comments you provided, the issue starts with you skipping a request in your chain.

Your start_urls will request this page: https://www.karton.eu/Faltkartons The page will be parse by the parse method and yield new requests from https://www.karton.eu/Karton-weiss to https://www.karton.eu/Einwellige-Kartonagen

Those pages will be parsed in the parse_item method, but they are not the final page you want. You need to parse between the cards and yield new requests, like this:

for url in response.xpath('//div[@class="cat-thumbnails"]/div/a/@href')
    yield scrapy.Request(response.urljoin(url.get()), callback=self.new_parsing_method)

Example here, when parsing https://www.karton.eu/Zweiwellige-Kartons will find 9 new links from

https://www.karton.eu/zweiwellig-ab-100-mm to...
https://www.karton.eu/zweiwellig-ab-1000-mm

Finally you need a parsing method to scrape the items in those pages. Since there are more than one item, I suggest you to run them in a for loop. (You need the proper xpath to scrape the data.)

EDIT:

Re-editing as now I observed the page structure and saw that my code was base on the wrong assumption. The thing is that some pages don't have the subcategory page, others do.

Page structure:

ROOT: www.karton.eu/Faltkartons
 |_ Einwellige Kartons
    |_ Subcategory: Kartons ab 100 mm Länge
      |_ Item List (www.karton.eu/einwellig-ab-100-mm)
        |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
    ...
    |_ Subcategory: Kartons ab 1000 mm Länge
      |_ ...
 |_ Zweiwellige Kartons #Same as above
 |_ Lange Kartons #Same as above
 |_ quadratische Kartons #There is no subcategory
    |_ Item List (www.karton.eu/quadratische-Kartons)
      |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
 |_ Kartons Höhenvariabel #There is no subcategory
 |_ Kartons weiß #There is no subcategory

The code bellow will scrape items from the pages with subcategories, as I think that's what you want. Either way I left a print statements to show you pages that will be skipped due to having no subcategory page. In case you want to include them later.

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        if not url2:
            print('Empty url2:', response.url)

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center artikelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        #items = KartonageItem() # You don't need this here, as the line bellow you are overwriting the variable.
        items = response.meta['items']
        items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

Notes

Changed this:

    card = response.xpath('//div[@class="text-center articelbox"]')

to this: (K instead of C)

    card = response.xpath('//div[@class="text-center artikelbox"]')

Commented this, as the items in meta is already a KartonageItem. (You can remove it)

def parse_item(self,response):
    table = response.xpath('//div[@class="product-info-inner"]')
    #items = KartonageItem()
    items = response.meta['items']

Changed this in the parse_items method:

    items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
    items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()

To this:

    items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
    items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()

As a doesn't exists in that method.

Scrapy crawl nested urls

Tags:

python

python-3.x

web-scraping

xpath

scrapy

kekw

1 Answers

EDIT:

Notes

renatodvc

Recent Activity

Donate For Us

Scrapy crawl nested urls

Tags:

python

python-3.x

web-scraping

xpath

scrapy

kekw

1 Answers

EDIT:

Notes

renatodvc

Related questions

Recent Activity

Donate For Us