Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for search in inverted index

Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this:

[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]

I am looking for an algorithm to find 100 rare word-pairs. A rare word-pair is a pair of words that occur together(not necessarily contiguous) in exactly 1 document.

I am looking for something better than O(n^2) if possible.

like image 304
funkyme Avatar asked Feb 05 '14 16:02

funkyme


People also ask

What is an inverted search index?

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.

Which searching technique is used in implementation of inverted files?

Querying a word in an inverted index consists in first locating the word in the vocabulary and getting the position of its inverted list. This operation can be performed in O(1) by using a hash algorithm. The inverted list of the word is then accessed in order to provide the search results.


1 Answers

  1. You order the words according to the number of documents they occur in. The idea here is, that words that occur rarely at all, will occur rarely in pairs as well. If you find words that occur in exactly one document, just pick any other word from that document and you are done.
  2. Then you start inverting the index, starting with the rarest word. That means you create a map where each document points to the set of words in it. At first you create that inverted index with the rarest word only. After you inserted all documents associated with that rarest word into the inverted index, you have a map where each document points to exactly one word.
  3. Then you add the next word with all its documents, still following the order created in (1.). At some point you will find that a document associated with a word is already present in your inverted map. Here you check all words associated with that document if they form such a rare word pair.

The performance of the thing depends heavily on how far you have to go to find 100 such pairs, the idea is that you are done after processing only a small fraction of the total data set. To take advantage of the fact that you only process a small fraction of the data, you should employ in (1.) a sort algorithm that allows you to find the smallest elements long before the entire set has been sorted, like quick sort. Then the sorting can be done in like O(N*log(N1) ), with N1 being the number of words you actually need to add to the inverted index before finding 100 pairs. The complexity of the other operations, namely adding a word to the inverted index and checking if a word pair occurs in more than one document also is linear with the number of documents per word, so those operations should be speedy at the beginning and slow down later, because later you have more documents per word.

like image 108
pentadecagon Avatar answered Sep 27 '22 23:09

pentadecagon