I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.
Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.
Thanks for all input
Deploying web scrapers on Heroku is not that difficult. However, if you want to do this in combination with Selenium, there are a few things to consider. In this article, I will explain everything in a bit more detail, because I am aware that some of the readers are beginners.
OctoParse, Webhose.io, Common Crawl, Mozenda, Content Grabber are a few of the best web scraping tools available for free.
Octoparse stands out as an easy-to-use, no-code web scraping tool. It provides cloud services to store extracted data and IP rotation to prevent IPs from getting blocked. You can schedule scraping at any specific time. Besides, it offers an infinite scrolling feature.
Since Amazon prevents web scraping on its pages, it can easily detect if an action is being executed by a scraper bot or through a browser by a manual agent. A lot of these trends are identified by closely monitoring the behavior of the browsing agent.
There are two approaches you can take.
The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD
. Just make sure scraper don't write to disk but uses database.
More information: https://devcenter.heroku.com/articles/one-off-dynos
The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).
More information: https://devcenter.heroku.com/articles/background-jobs-queueing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With