Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.

However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!

How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?

Thanks!

UPDATE:

  1. Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
  2. MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
like image 739
user1036719 Avatar asked Oct 03 '13 14:10

user1036719


People also ask

How do search engines rank pages?

To give you the most useful information, Search algorithms look at many factors and signals, including the words of your query, relevance and usability of pages, expertise of sources, and your location and settings. The weight applied to each factor varies depending on the nature of your query.

How does Google find so many results so fast?

Googlebot uses an algorithmic process to determine which sites to crawl, how often, and how many pages to fetch from each site. Google's crawlers are also programmed such that they try not to crawl the site too fast to avoid overloading it.

How can search engines deliver results so fast?

Search engines work by crawling billions of pages using web crawlers. Also known as spiders or bots, crawlers navigate the web and follow links to find new pages. These pages are then added to an index that search engines pull results from.

How does a search engine index work?

The index is where your discovered pages are stored. After a crawler finds a page, the search engine renders it just like a browser would. In the process of doing so, the search engine analyzes that page's contents. All of that information is stored in its index.


1 Answers

The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.

Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.

This gives the developers a degree of latitude entirely unavailable in almost all other domains.

The real question to ask is - how precisely do the results match the actual rank assigned to each page?

like image 184
Pekka Avatar answered Sep 28 '22 08:09

Pekka