Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python scrapy parse() function, where is the return value returned to?

I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
  name = ’example.com’
  allowed_domains = [’example.com’]
  start_urls = [
  ’http://www.example.com/1.html’,
  ’http://www.example.com/2.html’,
  ’http://www.example.com/3.html’,
  ]

  def parse(self, response):
    for h3 in response.xpath(’//h3’).extract():
      yield MyItem(title=h3)
    for url in response.xpath(’//a/@href’).extract():
      yield scrapy.Request(url, callback=self.parse) 

I know, the parse method must return an item or/and request, but where are these return values returned to?

One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider, it has Rule with callback. What about this callback's return value? where to ? same as parse()?

I am very confused on Scrapy procedure, even i read the document....

like image 385
SangminKim Avatar asked Oct 04 '14 18:10

SangminKim


People also ask

What is Scrapy Response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

What is yield in Scrapy?

This tutorial explains how to use yield in Scrapy. You can use regular methods such as printing and logging or using regular file handling methods to save the data returned from the Scrapy Spider. However, Scrapy offers an inbuilt way of saving and storing data through the yield keyword.

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.


1 Answers

According to the documentation:

The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Request objects).

In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler which pipes the requests to the Downloader for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback method).

The whole data-flow process is described in the Architecture Overview page in a very detailed manner.

Hope that helps.

like image 168
alecxe Avatar answered Sep 22 '22 04:09

alecxe