I have a simple web crawler to request all the pages from a website's sitemap that I need to cache and index. After several requests, the website begins serving blank pages.
There is nothing in their robots.txt
except the link to their sitemap, so I assume I am not breaking their "rules". I have a descriptive header that links to exactly what my intentions are, and the only pages I crawl are from their sitemap.
The http status codes are all still OK, so I can only imagine they're preventing large numbers of http requests in a short period of time. What is considered a reasonable amount of delay between requests?
Are there any other considerations I've overlooked that could potentially cause this problem?
Every site has different crawler and abuse characteristics it looks for.
The key for any crawler is to emulate human activity, and obey the robots.txt.
An exhaustive crawl will trip some websites, and they'll shut you down regardless of how slow you go, whereas some hosts don't mind crawlers zipping along and sucking everything up in one go.
If all else fails, don't request more quickly than one page per minute. If a website blocks you at this rate, then contact them directly - they obviously don't want you to use their content in that way.
I guess Wikipedia has a decent reference on the topic. Obey them and, for courtesy, a bit more.
For example, I'd probably would max the connection speed at one hit per second, or I'd be risking an inadvertent DoS-attack.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With