I'm trying to create a spider that crawls and scrapes every product from a store and outputs the results to a JSON file, that includes going into each category in the main page and scraping every product (just name and price), each product class page includes infinite scrolling.
My problem is that each time I make a request after scraping the first page of a class of item, instead of getting the next batch of items from that same type, I get the items from the next category and the output ends up being a mess.
I've already tried messing with settings and forcing concurrent requests to one and setting different priorities for each request.
I've found out about asynchronous crawling but I can't figure out how to create the requests in order.
import scrapy
from scrapper_pccom.items import ScrapperPccomItem
class PccomSpider(scrapy.Spider):
name = 'pccom'
allowed_domains = ['pccomponentes.com']
start_urls = ['https://www.pccomponentes.com/componentes']
#Scrapes links for every category from main page
def parse(self, response):
categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
prio = 20
for category in categories:
url = response.urljoin(category.extract())
yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
prio = prio - 1
#Scrapes products from every page of each category
def parse_item_list(self, response, prio):
products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
for product in products:
item = ScrapperPccomItem()
item['name'] = product.xpath('@data-name').extract()
item['price'] = product.xpath('@data-price').extract()
yield item
#URL of the next page
next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
if next_page:
next_url = response.urljoin(next_page)
yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
What it does: Category 1 page 1 > Cat 2 page 1 > Cat 3 page 1 > ...
What I want it to do: Cat 1 page 1 > Cat 1 page 2 > Cat 1 page 3 > ... > Cat 2 page 1
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.
Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.
While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
This is easy,
Get list of all categories in all_categories
, now don't scrape all links, just scrape 1st category link, and once all pages have been scraped for that category, then send request to another category link.
Here is the code, I did not run code so there maybe some syntax error, but logic is what you need
class PccomSpider(scrapy.Spider):
name = 'pccom'
allowed_domains = ['pccomponentes.com']
start_urls = ['https://www.pccomponentes.com/componentes']
all_categories = []
def yield_category(self):
if self.all_categories:
url = self.all_categories.pop()
print("Scraping category %s " % (url))
return scrapy.Request(url, self.parse_item_list)
else:
print("all done")
#Scrapes links for every category from main page
def parse(self, response):
categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
self.all_categories = list(response.urljoin(category.extract()) for category in categories)
yield self.yield_category()
#Scrapes products from every page of each category
def parse_item_list(self, response, prio):
products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
for product in products:
item = ScrapperPccomItem()
item['name'] = product.xpath('@data-name').extract()
item['price'] = product.xpath('@data-price').extract()
yield item
#URL of the next page
next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
if next_page:
next_url = response.urljoin(next_page)
yield scrapy.Request(next_url, self.parse_item_list)
else:
print("All pages of this category scraped, now scraping next category")
yield self.yield_category()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With