Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you Index Files for Fast Searches?

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.

What I want to know is how do they do this? Can you describe the algorithm?

like image 948
Unknown Avatar asked May 09 '09 23:05

Unknown


2 Answers

The simple case is an inverted index.

The most basic algorithm is simply:

  • scan the file for words, creating a list of unique words
  • normalize and filter the words
  • place an entry tying that word to the file in your index

The details are where things get tricky, but the fundamentals are the same.

By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).

After that, you've got a unique list of words for the file and you can build your index off of that.

There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).

But, that's the fundamental way it's done.

like image 143
Will Hartung Avatar answered Oct 19 '22 03:10

Will Hartung


Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹

1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.

Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).

Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.

2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.

3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.

4) You can also store word positions to infer phrases, etc.

Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.


¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine

like image 27
SquareCog Avatar answered Oct 19 '22 01:10

SquareCog