Assume I have an scraped item that looks like this
{
name: "Foo",
country: "US",
url: "http://..."
}
In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like
class MyPipeline(object):
def process_item(self, item, spider):
request(item['url'], function(response) {
if (...) {
raise DropItem()
}
return item
}, function(error){
raise DropItem()
})
Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?
The spider:
import scrapy
import json
class StationSpider(scrapy.Spider):
name = 'station'
start_urls = ['http://...']
def parse(self, response):
jsonResponse = json.loads(response.body_as_unicode())
for station in jsonResponse:
yield station
Easy way
import requests
def process_item(self, item, spider):
response = requests.get(item['url'])
if r.status_code ...:
raise DropItem()
elif response.text ...:
raise DropItem()
else:
return item
Scrapy way
Now I think you shouldn't do this inside a Pipeline, you should treat it inside the spider not yielding an item but a request and then yielding the item.
Now if you still want to include a scrapy Request inside a pipeline you could do something like this:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
...
self.crawler.engine.crawl(
Request(
url='someurl',
callback=self.custom_callback,
),
spider,
)
# you have to drop the item, and send it again after your check
raise DropItem()
# YES, you can define a method callback inside the same pipeline
def custom_callback(self, response):
...
yield item
Check that we are emulating the same behaviour of spider callbacks inside the pipeline. You need to figure out a way to always drop the items when you want to do an extra request, and just pass the ones that are being by the extra callback.
One way could be sending different types of items, and check them inside the process_item
of the pipeline:
def process_item(self, item, spider):
if isinstance(item, TempItem):
...
elif isinstance(item, FinalItem):
yield item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With