Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: crawl multiple spiders sharing same items, pipeline, and settings but with separate outputs

I am trying to run multiple spiders using a Python script based on the code provided in the official documentation. My scrapy project contains multiple spider (Spider1, Spider2, etc.) which crawl different websites and save the content of each website in a different JSON file (output1.json, output2.json, etc.).

The items collected on the different websites share the same structure, therefore the spiders use the same item, pipeline, and setting classes. The output is generated by a custom JSON class in the pipeline.

When I run the spiders separately they work as expected, but when I use the script below to run the spiders from with scrapy API the items get mixed in the pipeline. Output1.json should only contain items crawled by Spider1, but it also contains the items of Spider2. How can I crawl multiple spiders with scrapy API using same items, pipeline, and settings but generating separate outputs?

Here is the code I used to run multiple spiders:

import scrapy
from scrapy.crawler import CrawlerProcess
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

Example output1.json:

{
"Name": "Thomas"
"source": "Spider1"
}
{
"Name": "Paul"
"source": "Spider2"
}
{
"Name": "Nina"
"source": "Spider1"

}

Example output2.json:

{
"Name": "Sergio"
"source": "Spider1"
}
{
"Name": "David"
"source": "Spider1"
}
{
"Name": "James"
"source": "Spider2"
}

Normally, all the names crawled by spider1 ("source": "Spider1") should be in output1.json, and all the names crawled by spider2 ("source": "Spider2") should be in output2.json

Thank you for your help!

like image 498
jbp Avatar asked Oct 24 '25 18:10

jbp


2 Answers

The first problem was that spiders were running in the same process. Running the spiders sequentially by chaining the deferreds solved this problem:

#scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging

#spiders
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()

I also had a second mistake in my pipeline: I didn't clear my list of results when close_spider. Therefore, spider2 was adding items to a list that already contained the items of spider1.

class ExportJSON(object):

    results = []

    def process_item(self, item, spider):

        self.results.append(dict(item))
        return item

    def close_spider(self, spider):

        file = open(file_name, 'w')
        line = json.dumps(self.results)
        file.write(line)
        file.close()

        self.results.clear()

Thank you!

like image 102
jbp Avatar answered Oct 26 '25 07:10

jbp


Acording to docs to run spiders sequentially on the same process, you must chain deferreds.

Try this:

import scrapy
from scrapy.crawler import CrawlerRunner
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()
like image 31
Henrique Coura Avatar answered Oct 26 '25 06:10

Henrique Coura



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!