Is it possible to create a integration test of a scrapy-pipeline? I can't figure out how to do this. In particular I am trying to write a test for the FilesPipeline and I also want it to persist my mocked response to Amazon S3.
Here is my test:
def _mocked_download_func(request, info):
return Response(url=response.url, status=200, body="test", request=request)
class FilesPipelineTests(unittest.TestCase):
def setUp(self):
self.settings = get_project_settings()
crawler = Crawler(self.settings)
crawler.configure()
self.pipeline = FilesPipeline.from_crawler(crawler)
self.pipeline.open_spider(None)
self.pipeline.download_func = _mocked_download_func
@defer.inlineCallbacks
def test_file_should_be_directly_available_from_s3_when_processed(self):
item = CrawlResult()
item['id'] = "test"
item['file_urls'] = ['http://localhost/test']
result = yield self.pipeline.process_item(item, None)
self.assertEquals(result['files'][0]['path'], "full/002338a87aab86c6b37ffa22100504ad1262f21b")
I always run into the following error:
DirtyReactorAggregateError: Reactor was unclean.
How do I create a proper test using twisted and scrapy?
Up do now I did my pipeline tests without the call to from_crawler
, so they are not ideal, because they do not test the functionality of from_crawler
, but they work.
I do them by using an empty Spider
instance:
from scrapy.spiders import Spider
# some other imports for my own stuff and standard libs
@pytest.fixture
def mqtt_client():
client = mock.Mock()
return client
def test_mqtt_pipeline_does_return_item_after_process(mqtt_client):
spider = Spider(name='spider')
pipeline = MqttOutputPipeline(mqtt_client, 'dummy-namespace')
item = BasicItem()
item['url'] = 'http://example.com/'
item['source'] = 'dummy source'
ret = pipeline.process_item(item, spider)
assert ret is not None
(in fact, I forgot to call open_spider()
)
You can also have a look at how scrapy itself does the testing of pipelines, e.g. for MediaPipeline
:
class BaseMediaPipelineTestCase(unittest.TestCase):
pipeline_class = MediaPipeline
settings = None
def setUp(self):
self.spider = Spider('media.com')
self.pipe = self.pipeline_class(download_func=_mocked_download_func,
settings=Settings(self.settings))
self.pipe.open_spider(self.spider)
self.info = self.pipe.spiderinfo
def test_default_media_to_download(self):
request = Request('http://url')
assert self.pipe.media_to_download(request, self.info) is None
You can also have a look through their other unit tests. For me, these are always good inspiration on how to unit test scrapy components.
If you want to test the from_crawler
function, too, you could have a look on their Middleware
tests. In these tests, they often use from_crawler
to create middlewares, e.g. for OffsiteMiddleware.
from scrapy.spiders import Spider
from scrapy.utils.test import get_crawler
class TestOffsiteMiddleware(TestCase):
def setUp(self):
crawler = get_crawler(Spider)
self.spider = crawler._create_spider(**self._get_spiderargs())
self.mw = OffsiteMiddleware.from_crawler(crawler)
self.mw.spider_opened(self.spider)
I assume the key component here is to call get_crawler
from scrapy.utils.test
. Seems they factored out some calls you need to do in order to have a testing environment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With