Is it possible to crawl billions of pages on a single server?

Not if you want the data to be up to date. Even a small player in the search game would number the pages crawled in the multiple billions. <blockquote>" In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia </blockquote> And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that. Freshness of content: <ol> <li>New content is constantly added at a very large rate (reality) </li> <li>Existing pages often change - you'll need to recrawl for two reasons a) determine if it is dead, b) determine if the content has changed. </li> </ol> Politeness of crawler: <ol> <li> You can't overwhelm any one given sites. If you are hitting any major site repeatedly from the same IP you're likley to trigger either a CAPTCHA prompt or they'll block your IP address. Sites will do this based on bandwidth, frequency of requests, # of "bad" page requests, and all sorts of other things. </li> <li> There is a robots.txt protocol that sites expose to crawlers, obey it. </li> <li> There is a sitemap standard that sites expose to crawlers, use it to help you explore - you can also (if you choose) weight the relative importance of pages on the site and the use the time to live in your cache if indicated in the site map. </li> </ol> Reduce the work you need to do: <ol> <li> Often sites expose themselves through multiple names - you'll want to detect pages that are identical - this can happen on the same url or on seperate urls. Consider a hash on page contents (minus headers with dates/times that constantly change). Keep track of these page equivalencies and skip them next time or determine if there is a well known mapping between the given sites so that you don't have to crawl them. </li> <li> SPAM. Tons of people out there making tons of pages that are just pass throughs to google but they "seed" themselves all over the web to get themselves crawled. </li> </ol> So - you're always in a cycle of crawl. Always. You'll almost certainly be on several (many many many) machines. to ensure you can comply with politeness but still rock out on the freshness of data. If you want to press the fast forward button and just get to processing pages with your own unique algorithm.... you could likely tap into a pre-built crawler if you need it quickly - think "80 legs" as highlighted in Programmable Web. They do it using client side computing power. 80 legs is using machine cycles from kids playing games on web sites. Think of a background process on a web page that does calls out and does work while you’re using that page/site without you knowing it because they are using the Plura technology stack. <blockquote>“Plura Processing has developed a new and innovative technology for distributed computing. Our patent-pending technology can be embedded in any webpage. Visitors to these webpages become nodes and perform very small computations for the application running on our distributed computing network.” - Plura Demo Page</blockquote> So they are issuing the "crawl" through thousands of nodes at thousands of IPs and being polite to sites and crawling fast as a result. Now I personally don't know that I care for that style of using the end user's browser unless it were called out on all of the sites using their technology VERY clearly - but it's an out of the box approach if nothing else. There are other crawlers that have been written that are in community driven projects that you could likely use as well. As pointed out by other respondents - do the math. You'll need ~2300 pages crawled per second to keep up with crawling 1B pages every 5 days. If you're willing to wait longer the number goes down. If you're thinking you're going to crawl more than 1B the number goes up. Simple math. Good luck!

Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues: <ul> <li>Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.</li> <li>Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.</li> <li>I assume that you need to store the information you have extracted from the billions of pages.</li> <li>Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.</li> <li>To avoid getting stuck in loops, you need to detect that you've "done this page already".</li> <li>Pages change so you need to revisit them.</li> <li>You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.</li> </ul>

How to crawl billions of pages? [closed]

2 Answers

Not if you want the data to be up to date.

Even a small player in the search game would number the pages crawled in the multiple billions.

" In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia

And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that.

Freshness of content:

New content is constantly added at a very large rate (reality)
Existing pages often change - you'll need to recrawl for two reasons a) determine if it is dead, b) determine if the content has changed.

Politeness of crawler:

You can't overwhelm any one given sites. If you are hitting any major site repeatedly from the same IP you're likley to trigger either a CAPTCHA prompt or they'll block your IP address. Sites will do this based on bandwidth, frequency of requests, # of "bad" page requests, and all sorts of other things.
There is a robots.txt protocol that sites expose to crawlers, obey it.
There is a sitemap standard that sites expose to crawlers, use it to help you explore - you can also (if you choose) weight the relative importance of pages on the site and the use the time to live in your cache if indicated in the site map.

Reduce the work you need to do:

Often sites expose themselves through multiple names - you'll want to detect pages that are identical - this can happen on the same url or on seperate urls. Consider a hash on page contents (minus headers with dates/times that constantly change). Keep track of these page equivalencies and skip them next time or determine if there is a well known mapping between the given sites so that you don't have to crawl them.
SPAM. Tons of people out there making tons of pages that are just pass throughs to google but they "seed" themselves all over the web to get themselves crawled.

So - you're always in a cycle of crawl. Always. You'll almost certainly be on several (many many many) machines. to ensure you can comply with politeness but still rock out on the freshness of data.

If you want to press the fast forward button and just get to processing pages with your own unique algorithm.... you could likely tap into a pre-built crawler if you need it quickly - think "80 legs" as highlighted in Programmable Web. They do it using client side computing power.

80 legs is using machine cycles from kids playing games on web sites. Think of a background process on a web page that does calls out and does work while you’re using that page/site without you knowing it because they are using the Plura technology stack.

“Plura Processing has developed a new and innovative technology for distributed computing. Our patent-pending technology can be embedded in any webpage. Visitors to these webpages become nodes and perform very small computations for the application running on our distributed computing network.” - Plura Demo Page

So they are issuing the "crawl" through thousands of nodes at thousands of IPs and being polite to sites and crawling fast as a result. Now I personally don't know that I care for that style of using the end user's browser unless it were called out on all of the sites using their technology VERY clearly - but it's an out of the box approach if nothing else.

There are other crawlers that have been written that are in community driven projects that you could likely use as well.

As pointed out by other respondents - do the math. You'll need ~2300 pages crawled per second to keep up with crawling 1B pages every 5 days. If you're willing to wait longer the number goes down. If you're thinking you're going to crawl more than 1B the number goes up. Simple math.

Good luck!

answered Sep 23 '22 17:09

Dave Quick

Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues:

Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.
Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.
I assume that you need to store the information you have extracted from the billions of pages.
Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.
To avoid getting stuck in loops, you need to detect that you've "done this page already".
Pages change so you need to revisit them.
You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.

answered Sep 26 '22 17:09

Stephen C

Related questions
                            
                                Chrome Devtools: Save specific requests in Network Tab
                            
                                Building a geolocation photo index - crawling the web or relying on an existing API?
                            
                                What is the best Open Source Web Crawler Tool written in Java? [closed]
                            
                                Where to store web crawler data?
                            
                                Is Erlang the right choice for a webcrawler?
                            
                                is there any java script web crawler framework [closed]
                            
                                How to make a polygon radar (spider) chart in python
                            
                                How do I let search crawlers properly index pages with infinite scroll?
                            
                                Parse HTML content in VBA
                            
                                Protecting email addresses from spam bots / web crawlers
                            
                                Are Meta Keywords Case Sensitive? [closed]
                            
                                Does the url order matter in a XML sitemap?
                            
                                How do I stop all spiders and the engine immediately after a condition in a pipeline is met?
                            
                                Anybody knows a good extendable open source web-crawler? [closed]
                            
                                Replay a Scrapy spider on stored data
                            
                                Crawling and Scraping iTunes App Store
                            
                                How do I get the destination URL of a shortened URL using Ruby?
                            
                                Web Crawler - Ignore Robots.txt file?
                            
                                How to make Scrapy show user agent per download request in log?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to crawl billions of pages? [closed]

Tags:

web-crawler

gpow

People also ask

2 Answers

Dave Quick

Stephen C

Recent Activity

Donate For Us