Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this:
[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]
I am looking for an algorithm to find 100 rare word-pairs. A rare word-pair is a pair of words that occur together(not necessarily contiguous) in exactly 1 document.
I am looking for something better than O(n^2) if possible.
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.
Querying a word in an inverted index consists in first locating the word in the vocabulary and getting the position of its inverted list. This operation can be performed in O(1) by using a hash algorithm. The inverted list of the word is then accessed in order to provide the search results.
The performance of the thing depends heavily on how far you have to go to find 100 such pairs, the idea is that you are done after processing only a small fraction of the total data set. To take advantage of the fact that you only process a small fraction of the data, you should employ in (1.) a sort algorithm that allows you to find the smallest elements long before the entire set has been sorted, like quick sort. Then the sorting can be done in like O(N*log(N1) ), with N1 being the number of words you actually need to add to the inverted index before finding 100 pairs. The complexity of the other operations, namely adding a word to the inverted index and checking if a word pair occurs in more than one document also is linear with the number of documents per word, so those operations should be speedy at the beginning and slow down later, because later you have more documents per word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With