How to crawl in desired order or Synchronously in Scrapy?

The problem

I'm trying to create a spider that crawls and scrapes every product from a store and outputs the results to a JSON file, that includes going into each category in the main page and scraping every product (just name and price), each product class page includes infinite scrolling.

My problem is that each time I make a request after scraping the first page of a class of item, instead of getting the next batch of items from that same type, I get the items from the next category and the output ends up being a mess.

What I've already tried

I've already tried messing with settings and forcing concurrent requests to one and setting different priorities for each request.

I've found out about asynchronous crawling but I can't figure out how to create the requests in order.

Code

import scrapy
from scrapper_pccom.items import ScrapperPccomItem

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
        prio = 20
        for category in categories:
            url = response.urljoin(category.extract())
            yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
            prio = prio - 1

    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})

Output vs Expected

What it does: Category 1 page 1 > Cat 2 page 1 > Cat 3 page 1 > ...

What I want it to do: Cat 1 page 1 > Cat 1 page 2 > Cat 1 page 3 > ... > Cat 2 page 1

882

asked Sep 05 '19 15:09

Bustencio

1 Answers

This is easy,

Get list of all categories in all_categories, now don't scrape all links, just scrape 1st category link, and once all pages have been scraped for that category, then send request to another category link.

Here is the code, I did not run code so there maybe some syntax error, but logic is what you need

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    all_categories = []

    def yield_category(self):
        if self.all_categories:
            url = self.all_categories.pop()
            print("Scraping category %s " % (url))
            return scrapy.Request(url, self.parse_item_list)
        else:
            print("all done")


    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')

        self.all_categories = list(response.urljoin(category.extract()) for category in categories)
        yield self.yield_category()


    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list)

        else:
            print("All pages of this category scraped, now scraping next category")
            yield self.yield_category()

125

answered Sep 29 '22 12:09

Umair Ayub

Related questions
                            
                                Regex to Match mRNA Sequences
                            
                                ImportError: cannot import name '_counter' from 'Crypto.Util'
                            
                                Cannot import PyOpenCL in Juypter Notebook
                            
                                In Python, why do warnings not appear when using `eval`?
                            
                                Splitting a list with strings and nested lists of strings into a flat list
                            
                                Why is substring searching using 'in' operator, faster than using KMP algorithm?
                            
                                PyLaTeX: pylatex.errors.CompilerError: No LaTex compiler was found
                            
                                Getting Python package distribution version from within a package
                            
                                Using Panda's .at function to modify multiple rows
                            
                                Python pytest pytest_exception_interact customize exception information from VCR.py exception
                            
                                How to hide command prompt popup during launching PyLatex or Latexmk
                            
                                How to document options in an INI file with Sphinx
                            
                                Recommendation system with matrix factorization for huge data gives MemoryError
                            
                                How to provide multiple targets to a Seq2Seq model?
                            
                                RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment
                            
                                How do you enable macOS Dark Mode in PyQt5 (5.13)
                            
                                Generating correlated random potential using fast Fourier transform
                            
                                Using numerical values in plotly for creating Gantt-Charts
                            
                                Bool and missing values in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to crawl in desired order or Synchronously in Scrapy?

Tags:

python

scrapy