I'm crawling a web site (only two levels deep), and I want to scrape information from sites on both levels. The problem I'm running into, is I want to fill out the fields of one item with information from both levels. How do I do this?
I was thinking having a list of items as an instance variable that will be accessible by all threads (since it's the same instance of the spider), and parse_1 will fill out some fields, and parse_2 will have to check for the correct key before filling out the corresponding value. This method seems burdensome, and I'm still not sure how to make it work.
What I'm thinking is there must be a better way, maybe somehow passing an item to the callback. I don't know how to do that with the Request() method though. Ideas?
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
We use parse method and call this function, this function is used to extracts data from the sites, however, to scrape the sites it is necessary to understand the command response selector CSS and XPath. Request: It is a request which realizes a call for objects or data. Response: It obtains an answer to the Request.
From scrapy documentation:
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
So, basically you can scrape first page and store all information in item and then send whole item with request for that second level url and have all the information in one item.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With