Lately I have been reading about web crawling, indexing and serving. I have found some information on the Google Web Masters Tool - Google Basics about the process that Google does to crawl the Web and serve the searches. What I am wondering is how they save all those indexs? I mean, that's a lot to store right? How do they do it?
Thanks
Crawling: Google downloads text, images, and videos from pages it found on the internet with automated programs called crawlers. Indexing: Google analyzes the text, images, and video files on the page, and stores the information in the Google index, which is a large database.
Content that's accessed every second will end up being stored on RAM or SSDs. This represents a small amount of Google's entire index. The bulk of Google's index is stored on hard drives because, in Illyes' words, hard drives are cheap, accessible, and easy to replace.
Like most search engines, Google indexes documents by building a data structure known as inverted index. Such an index obtains a list of documents by a query word. The index is very large due to the number of documents stored in the servers. The index is partitioned by document IDs into many pieces called shards.
It takes between 4 days and 4 weeks for your brand new website to be crawled and indexed by Google. This range, however, is fairly broad and has been challenged by those who claim to have indexed sites in less than 4 days.
I'm answering myself because I found some interesting stuff that talks about Google index:
This helped me to understand it better, and I hope it help you too!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With