I use scrapy to create a project and add my own spider, say "spider_us.py", in the "spiders" folder, and I want to build an exe file which can be executed in other computers without installing scrapy.
When I follow the instructions of py2exe, I make a new file "Setup.py" in the same folder with following content:
from distutils.core import setup
import py2exe
setup(console = ["spider_us.py"])
however, it didn't work, since when I run my spider, I use the command "scrapy crawl spider_us" rather than directly running the file "spider_us.py" in the "spiders" folder.
how is it possible to build the entire spider program (automatically created by scrapy when I use "scrapy startproject XXX") to an exe file, not only the spider file ("spider_us.py" in my case) in the "spiders" subfolder.
Anyone gives some advise or help, any comment is welcomed. Thanks so much.
Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Try running the spiders through a Python script (instead of the command scrapy crawl <spider_name>
). You'll need to write some code, e.g.:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
For details, see the documentations on "Run Scrapy from a script"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With