Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Heroku and Web scraping

I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.

Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.

Thanks for all input

like image 817
John Lamburger Avatar asked Jul 12 '13 00:07

John Lamburger


People also ask

Does heroku allow web scraping?

Deploying web scrapers on Heroku is not that difficult. However, if you want to do this in combination with Selenium, there are a few things to consider. In this article, I will explain everything in a bit more detail, because I am aware that some of the readers are beginners.

Which website is best for web scraping?

OctoParse, Webhose.io, Common Crawl, Mozenda, Content Grabber are a few of the best web scraping tools available for free.

Which tool is better for web scraping?

Octoparse stands out as an easy-to-use, no-code web scraping tool. It provides cloud services to store extracted data and IP rotation to prevent IPs from getting blocked. You can schedule scraping at any specific time. Besides, it offers an infinite scrolling feature.

Does Amazon allow scraping?

Since Amazon prevents web scraping on its pages, it can easily detect if an action is being executed by a scraper bot or through a browser by a manual agent. A lot of these trends are identified by closely monitoring the behavior of the browsing agent.


1 Answers

There are two approaches you can take.

The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD. Just make sure scraper don't write to disk but uses database.

More information: https://devcenter.heroku.com/articles/one-off-dynos

The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).

More information: https://devcenter.heroku.com/articles/background-jobs-queueing

like image 174
XLII Avatar answered Sep 20 '22 11:09

XLII