Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy, make http request in pipeline

Assume I have an scraped item that looks like this

{
    name: "Foo",
    country: "US",
    url: "http://..."
}

In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like

class MyPipeline(object):
    def process_item(self, item, spider):
        request(item['url'], function(response) {
           if (...) {
             raise DropItem()
           }
           return item
        }, function(error){ 
            raise DropItem()
        })

Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?

The spider:

import scrapy
import json

class StationSpider(scrapy.Spider):
    name = 'station'
    start_urls = ['http://...']

    def parse(self, response):
        jsonResponse = json.loads(response.body_as_unicode())
        for station in jsonResponse:
            yield station
like image 718
UpCat Avatar asked Jul 19 '16 19:07

UpCat


1 Answers

Easy way

import requests

def process_item(self, item, spider):
    response = requests.get(item['url'])
    if r.status_code ...:
        raise DropItem()
    elif response.text ...:
        raise DropItem()
    else:
        return item

Scrapy way

Now I think you shouldn't do this inside a Pipeline, you should treat it inside the spider not yielding an item but a request and then yielding the item.

Now if you still want to include a scrapy Request inside a pipeline you could do something like this:

class MyPipeline(object):

    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        ...
        self.crawler.engine.crawl(
                    Request(
                        url='someurl',
                        callback=self.custom_callback,
                    ),
                    spider,
                )

        # you have to drop the item, and send it again after your check
        raise DropItem()
    # YES, you can define a method callback inside the same pipeline
    def custom_callback(self, response):
        ...
        yield item

Check that we are emulating the same behaviour of spider callbacks inside the pipeline. You need to figure out a way to always drop the items when you want to do an extra request, and just pass the ones that are being by the extra callback.

One way could be sending different types of items, and check them inside the process_item of the pipeline:

def process_item(self, item, spider):
    if isinstance(item, TempItem):
        ...
    elif isinstance(item, FinalItem):
        yield item
like image 199
eLRuLL Avatar answered Nov 07 '22 01:11

eLRuLL