I want to restrict the crawler access to my rails app running on Heroku. This would have been a straight forward task if I was using Apache OR nginX. Since the app is deployed on Heroku I am not sure how I can restrict access at the HTTP server level.
I have tried to use robots.txt file, but the offending crawlers don't honor robot.txt.
These are the solutions I am considering:
1) A before_filter
in the rails layer to restrict access.
2) Rack based solution to restrict access
I am wondering if there are any better ways to deal with this problem.
I have read about honeypot solutions: You have one URI that must not be crawled (put it in robots.txt). If any IP calls this URI, block it. I'd implement it as a Rack middleware so the hit does not go to the full Rails stack.
Sorry, I googled around but could not find the original article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With