This was the closest question to my question and it wasn't really answered very well imo:
Web scraping etiquette
I'm looking for the answer to #1:
How many requests/second should you be doing to scrape?
Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth.
Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.
Using requests library, we can fetch the content from the URL given and beautiful soup library helps to parse it and fetch the details the way we want. You can use a beautiful soup library to fetch data using Html tag, class, id, css selector and many more ways.
So is it legal or illegal? Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.
the wikipedia article on web crawling has some info about what others are doing:
Cho[22] uses 10 seconds as an interval for accesses, and the WIRE crawler [28] uses 15 seconds as the default. The MercatorWeb crawler follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page.[29] Dill et al. [30] use 1 second.
I generally try 5 seconds with a bit of randomness so it looks less suspicious.
There is no set standard for this, it depends on how much load the web scraping causes. As long as you aren't noticeably effecting the speed of the site for other users, it should be an acceptable scraping speed.
Since the amount of users and load on a website fluctuates constantly, it'd be a good idea to dynamically adjust your scraping speed.
Monitor the latency of downloading each page, and if the latency is starting to increase, start to decrease your scraping speed. Essentially, the website's load/latency should be inversely proportional to your scraping speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With