Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to build my scrapy spider to an exe file using py2exe?

I use scrapy to create a project and add my own spider, say "spider_us.py", in the "spiders" folder, and I want to build an exe file which can be executed in other computers without installing scrapy.

When I follow the instructions of py2exe, I make a new file "Setup.py" in the same folder with following content:

from distutils.core import setup
import py2exe

setup(console = ["spider_us.py"])

however, it didn't work, since when I run my spider, I use the command "scrapy crawl spider_us" rather than directly running the file "spider_us.py" in the "spiders" folder.

how is it possible to build the entire spider program (automatically created by scrapy when I use "scrapy startproject XXX") to an exe file, not only the spider file ("spider_us.py" in my case) in the "spiders" subfolder.

Anyone gives some advise or help, any comment is welcomed. Thanks so much.

like image 900
Myzh Avatar asked Oct 18 '13 02:10

Myzh


People also ask

How do you use scrapy in CMD?

Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

How do you run a scrapy spider from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What is a spider in scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).


1 Answers

Try running the spiders through a Python script (instead of the command scrapy crawl <spider_name>). You'll need to write some code, e.g.:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

For details, see the documentations on "Run Scrapy from a script"

like image 56
starrify Avatar answered Oct 22 '22 03:10

starrify