Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good crawling speed rate?

I'm crawling web pages to create a search engine and have been able to crawl close to 9300 pages in 1 hour using Scrapy. I'd like to know how much more can I improve and what value is considered as a 'good' crawling speed.

like image 415
Nilesh Guria Avatar asked Mar 26 '18 14:03

Nilesh Guria


People also ask

What is the speed of crawling?

The crawling speeds for the escape context studies reported a normal crawling speed of 0.71 m/s and maximum crawling speed of 1.47 m/s [6,19]. It should be noted that such velocities may not be possible in mines given the rough/wet ground conditions usually present.

How can I improve my crawling speed?

Too many errors on your site If you have a lot of errors on your site for Google, Google will start crawling slowly too. To speed up the crawl process, fix those errors. Simply 301 redirect those erroring pages to proper URLs on your site. If you don't know where to find those errors: log into Google Search Console.


1 Answers

Short answer: There is no real recommended speed for creating a search engine.

Long answer:

Crawling speed, in general, doesn't really determine if your crawler is good or bad, or even if it will work as the program that feeds your search engine.

You also cannot talk about crawling speed when talking to crawl a lot of pages, on multiple sites. Crawling speed should be determined per site only, meaning that the crawler should be configurable in a way that it can be changed how often it hits a site at any specific time, you can see that Google also offers this.

If we are talking about the current rate you mentioned (9300/hour), it means you are collecting ~2.5 pages per second, which I would say it is not bad, but as explained before, it doesn't help determine your end goal (create a search engine).

Also, if you really decide to implement a broad crawler for creating a search engine with Scrapy, you'll never only send 1 process with Scrapy. You'll need to setup thousands (even more) of spiders running to check to get the more information needed. Also you'll have to setup different services to help you maintain those spiders and how they behave between processes. For starters I would recommend checking Frontera and Scrapyd.

like image 195
eLRuLL Avatar answered Sep 26 '22 15:09

eLRuLL