I want to crawl a website with 2 parts and my script is not as fast as I need. Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? I tried to have 2 different classes, and run them <pre class="prettyprint"><code>scrapy crawl firstSpider scrapy crawl secondSpider </code></pre> but i think that it is not smart. I read the documentation of scrapyd but I don't know if it's good for my case.

Better solution is (if you have multiple spiders) it dynamically get spiders and run them. <pre class="prettyprint"><code>from scrapy import spiderloader from scrapy.utils import project from twisted.internet.defer import inlineCallbacks @inlineCallbacks def crawl(): settings = project.get_project_settings() spider_loader = spiderloader.SpiderLoader.from_settings(settings) spiders = spider_loader.list() classes = [spider_loader.load(name) for name in spiders] for my_spider in classes: yield runner.crawl(my_spider) reactor.stop() crawl() reactor.run() </code></pre> (Second Solution): Because <code>spiders.list()</code> is deprecated in Scrapy 1.4 Yuda solution should be converted to something like <pre class="prettyprint"><code>from scrapy import spiderloader from scrapy.utils.project import get_project_settings from scrapy.crawler import CrawlerProcess settings = get_project_settings() process = CrawlerProcess(settings) spider_loader = spiderloader.SpiderLoader.from_settings(settings) for spider_name in spider_loader.list(): print("Running spider %s" % (spider_name)) process.crawl(spider_name) process.start() </code></pre>

Running Multiple spiders in scrapy for 1 website in parallel?

Tags:

python

web-scraping

scrapy

web-crawler

scrapy-spider

I want to crawl a website with 2 parts and my script is not as fast as I need.

Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?

I tried to have 2 different classes, and run them

scrapy crawl firstSpider
scrapy crawl secondSpider

but i think that it is not smart.

I read the documentation of scrapyd but I don't know if it's good for my case.

227

asked Sep 07 '16 08:09

parik

3 Answers

I think what you are looking for is something like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

You can read more at: running-multiple-spiders-in-the-same-process.

198

answered Sep 20 '22 05:09

K Hörnell

Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

answered Sep 18 '22 05:09

Yuda Prawira

Better solution is (if you have multiple spiders) it dynamically get spiders and run them.

from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks


@inlineCallbacks
def crawl():
    settings = project.get_project_settings()
    spider_loader = spiderloader.SpiderLoader.from_settings(settings)
    spiders = spider_loader.list()
    classes = [spider_loader.load(name) for name in spiders]
    for my_spider in classes:
        yield runner.crawl(my_spider)
    reactor.stop()

crawl()
reactor.run()

(Second Solution): Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like

from scrapy import spiderloader    
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)

for spider_name in spider_loader.list():
    print("Running spider %s" % (spider_name))
    process.crawl(spider_name)
process.start()