I want to crawl a website with 2 parts and my script is not as fast as I need.
Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part?
I tried to have 2 different classes, and run them
scrapy crawl firstSpider
scrapy crawl secondSpider
but i think that it is not smart.
I read the documentation of scrapyd but I don't know if it's good for my case.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl . Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
I think what you are looking for is something like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
You can read more at: running-multiple-spiders-in-the-same-process.
Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
Better solution is (if you have multiple spiders) it dynamically get spiders and run them.
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(Second Solution):
Because spiders.list()
is deprecated in Scrapy 1.4 Yuda solution should be converted to something like
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With