Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deploy a Scrapy spider on Heroku cloud

I developed few spiders in scrapy & I want to test those on Heroku cloud. Does anybody have any idea about how to deploy a Scrapy spider on Heroku cloud?

like image 754
Aniruddha Avatar asked Oct 08 '12 09:10

Aniruddha


1 Answers

Yes, it's fairly simple to deploy and run your Scrapy spider on Heroku.

Here are the steps using a real Scrapy project as example:

  1. Clone the project (note that it must have a requirements.txt file for Heroku to recognize it as a Python project):

    git clone https://github.com/scrapinghub/testspiders.git

  2. Add cffi to the requirement.txt file (e.g. cffi==1.1.0).

  3. Create the Heroku application (this will add a new heroku git remote):

    heroku create

  4. Deploy the project (this will take a while the first time, when the slug is built):

    git push heroku main

  5. Run your spider:

    heroku run scrapy crawl followall

Some notes:

  • Heroku disk is ephemeral. If you want to store the scraped data in a persistent place, you can use a S3 feed export (by appending -o s3://mybucket/items.jl) or use an addon (like MongoHQ or Redis To Go) and write a pipeline to store your items there
  • It would be cool to run a Scrapyd server on Heroku, but it's not currently possible because the sqlite3 module (which Scrapyd requires) doesn't work on Heroku
  • If you want a more sophisticated solution for deploying your Scrapy spiders, consider setting up your own Scrapyd server or using a hosted service like Scrapy Cloud
like image 197
Pablo Hoffman Avatar answered Nov 02 '22 19:11

Pablo Hoffman