I don't know what they use in normal windows searching.But there is a technique in which you use indexing of files at once and then use the index later for faster searching.(e.g. Windows search 4.0)
Is there any other way for faster searching than this? Can you elaborate from implementation point of view? (Assuming that I may need to implement it)
To make it simple to understand, let me put it in this way:
Assume that I want to build an search application which performs the search operation similar to the one we use in windows.
My question is, what are the possible options/ways/approaches available to build such an application? (and which are faster than the existing ones.)
(Can binary search tree kind of technique be used?)
The most famous techniques of searching in data structures are: 1. Sequential Search This is the traditional technique for searching an element in a collection of elements. In this type of search, all the elements of the list are traversed one by one to find if the element is present in the list or not.
Searching is an operation or a technique that helps finds the place of a given element or value in the list. Any search is said to be successful or unsuccessful depending upon whether the element that is being searched is found or not.
Best first search is a search technique which explores the nodes in a graph with a heuristic evaluation function (Kohavi and John, 1997 ). In feature selection, the best first search uses this evaluation function to score every candidate feature and selects the one which provides best “score” first.
The term “direct search methods” refers to a class of methods that do not calculate or use or approximate derivatives of the problem functions in their search process. The functions are assumed to be continuous and differentiable; however, their derivatives are either unavailable or not trustworthy.
There are basically two techniques used for full-text search over big corpuses: posting lists and suffix arrays.
A posting list is a list of (term, document_id) pairs, optionally with a position in the document. If you sort it or hash it by term, you have an efficiently-searchable full-text index.
There are various techniques for making posting lists smaller, faster to access, faster to update, and more flexible, some at the cost of accuracy. Lucene is probably the best off-the-shelf posting-list-based text indexer available today, and (contrary to your earlier comment) it can index text found in PDF, Microsoft Word, etc., files. The Lucene.net project linked by Thomas Maierhofer looks like a pretty reasonable port, although of course you'll always be a bit behind the leading edge of what's going on in the Java version.
For a corpus that's much bigger than memory, you pretty much have to store the posting list on disk. That militates against using a simple binary search tree to access it: if you have a hundred thousand documents of ten thousand words each, you have a billion postings, which means your binary search tree has a minimum depth of 30. The trouble with that is that the 30 nodes on the path from the root of the tree to the leaf will, in general, be located in different parts of your disk — so the disk has to seek 30 times to find the postings for one term! That's about 2½ seconds, which is prohibitively slow.
However, there is a modified version of the binary tree data structure called a “B-tree” which can work. Lucene uses a simple data structure that's a lot like a B-tree, but supports massive updates much more easily. I wrote a very simple version of this same data structure in my own dumbfts project, which implements a full-text search engine for my email in a few pages of Python. I use it every day, it's free software, and it works pretty well for what I use it for, but it's not exactly a world-class search system like Lucene.
As an example of making posting lists smaller at the cost of accuracy, the Managing Gigabytes book (and the mg4j project) has a data structure called a "signed minimal perfect hash table", which doesn't actually store the terms that were indexed — just hashes of them. So there's a small probability of a false positive — you have to retrieve the documents that supposedly contain the term in order to confirm that they really do.
Suffix arrays, which are a much more compact and slightly slower version of radix trees (aka tries), are implemented by GLIMPSE and a few other programs, but they've basically fallen out of use these days. They have some flexibility not present in the posting-list data structure — they allow regular expression searches and searches with misspellings, for example, but they're not quite as fast. There has been some recent work with the Burrows-Wheeler Transform, which is based on suffix arrays, providing a compression algorithm in which the compressed file is the full-text index! The best-documented version of this is called the FM-index, although I've heard that there are older versions of the technique, maybe unpublished. Unlike the other techniques described above, though, I think this doesn't actually work when the documents are PDF files or something like that — you could still use the same approach of extracting a text version of each page and indexing that, but you don't get the benefit of compressing the original document.
My acquaintance Tim wrote a really good introductory series of blog postings on search back in 2003, which are still pretty great. They cover this stuff (except for the recent developments) in a lot more depth.
Ravi: Is this the kind of information you're looking for?
Edit: thanks for fixing my formatting, Martin!
Take a look at Lucene. It is a super fast search library for text(files). There is also Lucene.NET available. If you like to implement it yourself, it is a good starting point and benchmark for your implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With