Is it possible to crawl billions of pages on a single server?
The whole website or certain pages can remain unseen by Google for a simple reason: its site crawlers are not allowed to enter them. There are several bot commands, which will prevent page crawling. Note, that it's not a mistake to have these parameters in robots.
Assuming a website is crawlable and indexable, Google usually takes 3-4 days to a couple of weeks. Having a sitemap file submitted in Google Search Console usually helps making the process faster, but that doesn't mean it will be crawled for certain.
Not if you want the data to be up to date.
Even a small player in the search game would number the pages crawled in the multiple billions.
" In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia
And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that.
Freshness of content:
Politeness of crawler:
Reduce the work you need to do:
So - you're always in a cycle of crawl. Always. You'll almost certainly be on several (many many many) machines. to ensure you can comply with politeness but still rock out on the freshness of data.
If you want to press the fast forward button and just get to processing pages with your own unique algorithm.... you could likely tap into a pre-built crawler if you need it quickly - think "80 legs" as highlighted in Programmable Web. They do it using client side computing power.
80 legs is using machine cycles from kids playing games on web sites. Think of a background process on a web page that does calls out and does work while you’re using that page/site without you knowing it because they are using the Plura technology stack.
“Plura Processing has developed a new and innovative technology for distributed computing. Our patent-pending technology can be embedded in any webpage. Visitors to these webpages become nodes and perform very small computations for the application running on our distributed computing network.” - Plura Demo Page
So they are issuing the "crawl" through thousands of nodes at thousands of IPs and being polite to sites and crawling fast as a result. Now I personally don't know that I care for that style of using the end user's browser unless it were called out on all of the sites using their technology VERY clearly - but it's an out of the box approach if nothing else.
There are other crawlers that have been written that are in community driven projects that you could likely use as well.
As pointed out by other respondents - do the math. You'll need ~2300 pages crawled per second to keep up with crawling 1B pages every 5 days. If you're willing to wait longer the number goes down. If you're thinking you're going to crawl more than 1B the number goes up. Simple math.
Good luck!
Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues:
Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.
Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.
I assume that you need to store the information you have extracted from the billions of pages.
Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.
To avoid getting stuck in loops, you need to detect that you've "done this page already".
Pages change so you need to revisit them.
You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With