Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl in desired order or Synchronously in Scrapy?

Tags:

python

scrapy

The problem

I'm trying to create a spider that crawls and scrapes every product from a store and outputs the results to a JSON file, that includes going into each category in the main page and scraping every product (just name and price), each product class page includes infinite scrolling.

My problem is that each time I make a request after scraping the first page of a class of item, instead of getting the next batch of items from that same type, I get the items from the next category and the output ends up being a mess.

What I've already tried

I've already tried messing with settings and forcing concurrent requests to one and setting different priorities for each request.

I've found out about asynchronous crawling but I can't figure out how to create the requests in order.

Code

import scrapy
from scrapper_pccom.items import ScrapperPccomItem

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
        prio = 20
        for category in categories:
            url = response.urljoin(category.extract())
            yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
            prio = prio - 1

    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})

Output vs Expected

What it does: Category 1 page 1 > Cat 2 page 1 > Cat 3 page 1 > ...

What I want it to do: Cat 1 page 1 > Cat 1 page 2 > Cat 1 page 3 > ... > Cat 2 page 1

like image 882
Bustencio Avatar asked Sep 05 '19 15:09

Bustencio


People also ask

Is Scrapy asynchronous?

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.

What does Scrapy crawl do?

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.

What is callback in Scrapy?

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.

How do you scrape data from a website using Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .


1 Answers

This is easy,

Get list of all categories in all_categories, now don't scrape all links, just scrape 1st category link, and once all pages have been scraped for that category, then send request to another category link.

Here is the code, I did not run code so there maybe some syntax error, but logic is what you need

class PccomSpider(scrapy.Spider):
    name = 'pccom'
    allowed_domains = ['pccomponentes.com']
    start_urls = ['https://www.pccomponentes.com/componentes']

    all_categories = []

    def yield_category(self):
        if self.all_categories:
            url = self.all_categories.pop()
            print("Scraping category %s " % (url))
            return scrapy.Request(url, self.parse_item_list)
        else:
            print("all done")


    #Scrapes links for every category from main page
    def parse(self, response):
        categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')

        self.all_categories = list(response.urljoin(category.extract()) for category in categories)
        yield self.yield_category()


    #Scrapes products from every page of each category      
    def parse_item_list(self, response, prio):

        products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
        for product in products:
            item = ScrapperPccomItem()
            item['name'] = product.xpath('@data-name').extract()
            item['price'] = product.xpath('@data-price').extract()
            yield item

        #URL of the next page
        next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
        if next_page:
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url, self.parse_item_list)

        else:
            print("All pages of this category scraped, now scraping next category")
            yield self.yield_category()
like image 125
Umair Ayub Avatar answered Sep 29 '22 12:09

Umair Ayub