Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing Inverted Index

I know that inverted indexing is a good way to index words, but what I'm confused about is how the search engines actually store them? For example, if a word "google" appears in document - 2, 4, 6, 8 with different frequencies, where should store them? Can a database table with one-to-many relation would do any good for storing them?

like image 400
user3036757 Avatar asked Sep 18 '14 06:09

user3036757


People also ask

How is inverted index stored?

The inverted index is typically stored on the disk and is loaded on a dynamic basis depending on the query... e.g. if the query is "stack overflow", you hit on the individual lists corresponding to the terms 'stack' and 'overflow'...

What can be compressed in an inverted index?

Inverted indexes can store additional information about each term, such as the set of positions where the terms appear in the documents (in positional indexes) and the number of occurrences of the terms in the documents, i.e., their frequencies [14, 56, 113].

Does Google use inverted index?

Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.

How does Elasticsearch store data What is inverted indexing?

By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees.


1 Answers

It is highly unlikely that fullfledged SQL-like databases are used for this purpose. First, it is called an inverted index because it is just an index. Each entry is just a reference. As non-relational databases and key-value stores came up as a favourite topic in relation to web technology.

  • You only ever have one way of accessing the data (by query word). That is why it's called an index.
  • Each entry is a list/array/vector of references to documents, so each element of that list is very small. The only other information besides of storing a documentID would be to store a tf-idf score for each element.

How to use it:

If you have a single query word ("google") then you look up in the inverted index in which documents this word turns up (2,4,6,8 in your example). If you have tf-idf scores, you can sort the results to report the best matching document first. You then go and look up which documents the document IDs 2,4,6,8 refer to, and report their URL as well as a snippet etc. URL, snippets etc are probably best stored in another table or key-value store.

If you have multiple query words ("google" and "altavista"), you look into the II for both query words and you get two lists of document IDs (2,4,6,8 and 3,7,8,11,19). You take the intersection of both lists, which in this case is (8), which is the list of documents in which both query words occur.

like image 174
Unapiedra Avatar answered Sep 22 '22 05:09

Unapiedra