Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl billions of pages? [closed]

Tags:

web-crawler

Is it possible to crawl billions of pages on a single server?

like image 410
gpow Avatar asked Dec 20 '09 07:12

gpow


People also ask

Why are pages blocked from crawling?

The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them. There are several bot commands, which will prevent page crawling. Note, that it's not a mistake to have these parameters in robots.

How long does it take to crawl the entire web?

Assuming a website is crawlable and indexable, Google usually takes 3-4 days to a couple of weeks. Having a sitemap file submitted in Google Search Console usually helps making the process faster, but that doesn't mean it will be crawled for certain.


2 Answers

Not if you want the data to be up to date.

Even a small player in the search game would number the pages crawled in the multiple billions.

" In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia

And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that.

Freshness of content:

  1. New content is constantly added at a very large rate (reality)
  2. Existing pages often change - you'll need to recrawl for two reasons a) determine if it is dead, b) determine if the content has changed.

Politeness of crawler:

  1. You can't overwhelm any one given sites. If you are hitting any major site repeatedly from the same IP you're likley to trigger either a CAPTCHA prompt or they'll block your IP address. Sites will do this based on bandwidth, frequency of requests, # of "bad" page requests, and all sorts of other things.
  2. There is a robots.txt protocol that sites expose to crawlers, obey it.
  3. There is a sitemap standard that sites expose to crawlers, use it to help you explore - you can also (if you choose) weight the relative importance of pages on the site and the use the time to live in your cache if indicated in the site map.

Reduce the work you need to do:

  1. Often sites expose themselves through multiple names - you'll want to detect pages that are identical - this can happen on the same url or on seperate urls. Consider a hash on page contents (minus headers with dates/times that constantly change). Keep track of these page equivalencies and skip them next time or determine if there is a well known mapping between the given sites so that you don't have to crawl them.
  2. SPAM. Tons of people out there making tons of pages that are just pass throughs to google but they "seed" themselves all over the web to get themselves crawled.

So - you're always in a cycle of crawl. Always. You'll almost certainly be on several (many many many) machines. to ensure you can comply with politeness but still rock out on the freshness of data.

If you want to press the fast forward button and just get to processing pages with your own unique algorithm.... you could likely tap into a pre-built crawler if you need it quickly - think "80 legs" as highlighted in Programmable Web. They do it using client side computing power.

80 legs is using machine cycles from kids playing games on web sites. Think of a background process on a web page that does calls out and does work while you’re using that page/site without you knowing it because they are using the Plura technology stack.

“Plura Processing has developed a new and innovative technology for distributed computing. Our patent-pending technology can be embedded in any webpage. Visitors to these webpages become nodes and perform very small computations for the application running on our distributed computing network.” - Plura Demo Page

So they are issuing the "crawl" through thousands of nodes at thousands of IPs and being polite to sites and crawling fast as a result. Now I personally don't know that I care for that style of using the end user's browser unless it were called out on all of the sites using their technology VERY clearly - but it's an out of the box approach if nothing else.

There are other crawlers that have been written that are in community driven projects that you could likely use as well.

As pointed out by other respondents - do the math. You'll need ~2300 pages crawled per second to keep up with crawling 1B pages every 5 days. If you're willing to wait longer the number goes down. If you're thinking you're going to crawl more than 1B the number goes up. Simple math.

Good luck!

like image 78
Dave Quick Avatar answered Sep 23 '22 17:09

Dave Quick


Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues:

  • Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.

  • Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.

  • I assume that you need to store the information you have extracted from the billions of pages.

  • Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.

  • To avoid getting stuck in loops, you need to detect that you've "done this page already".

  • Pages change so you need to revisit them.

  • You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.

like image 35
Stephen C Avatar answered Sep 26 '22 17:09

Stephen C