Which is the fastest search technique/method? (In context of file searching)

2 Answers

There are basically two techniques used for full-text search over big corpuses: posting lists and suffix arrays.

A posting list is a list of (term, document_id) pairs, optionally with a position in the document. If you sort it or hash it by term, you have an efficiently-searchable full-text index.

There are various techniques for making posting lists smaller, faster to access, faster to update, and more flexible, some at the cost of accuracy. Lucene is probably the best off-the-shelf posting-list-based text indexer available today, and (contrary to your earlier comment) it can index text found in PDF, Microsoft Word, etc., files. The Lucene.net project linked by Thomas Maierhofer looks like a pretty reasonable port, although of course you'll always be a bit behind the leading edge of what's going on in the Java version.

For a corpus that's much bigger than memory, you pretty much have to store the posting list on disk. That militates against using a simple binary search tree to access it: if you have a hundred thousand documents of ten thousand words each, you have a billion postings, which means your binary search tree has a minimum depth of 30. The trouble with that is that the 30 nodes on the path from the root of the tree to the leaf will, in general, be located in different parts of your disk — so the disk has to seek 30 times to find the postings for one term! That's about 2½ seconds, which is prohibitively slow.

However, there is a modified version of the binary tree data structure called a “B-tree” which can work. Lucene uses a simple data structure that's a lot like a B-tree, but supports massive updates much more easily. I wrote a very simple version of this same data structure in my own dumbfts project, which implements a full-text search engine for my email in a few pages of Python. I use it every day, it's free software, and it works pretty well for what I use it for, but it's not exactly a world-class search system like Lucene.

As an example of making posting lists smaller at the cost of accuracy, the Managing Gigabytes book (and the mg4j project) has a data structure called a "signed minimal perfect hash table", which doesn't actually store the terms that were indexed — just hashes of them. So there's a small probability of a false positive — you have to retrieve the documents that supposedly contain the term in order to confirm that they really do.

Suffix arrays, which are a much more compact and slightly slower version of radix trees (aka tries), are implemented by GLIMPSE and a few other programs, but they've basically fallen out of use these days. They have some flexibility not present in the posting-list data structure — they allow regular expression searches and searches with misspellings, for example, but they're not quite as fast. There has been some recent work with the Burrows-Wheeler Transform, which is based on suffix arrays, providing a compression algorithm in which the compressed file is the full-text index! The best-documented version of this is called the FM-index, although I've heard that there are older versions of the technique, maybe unpublished. Unlike the other techniques described above, though, I think this doesn't actually work when the documents are PDF files or something like that — you could still use the same approach of extracting a text version of each page and indexing that, but you don't get the benefit of compressing the original document.

My acquaintance Tim wrote a really good introductory series of blog postings on search back in 2003, which are still pretty great. They cover this stuff (except for the recent developments) in a lot more depth.

Ravi: Is this the kind of information you're looking for?

Edit: thanks for fixing my formatting, Martin!

answered Oct 22 '22 04:10

Kragen Javier Sitaker

Take a look at Lucene. It is a super fast search library for text(files). There is also Lucene.NET available. If you like to implement it yourself, it is a good starting point and benchmark for your implementation.

answered Oct 22 '22 04:10

Thomas Maierhofer

Related questions
                            
                                Loading XDocument while validating against schema in .NET 3.5
                            
                                remove dot at end of text
                            
                                Get only two decimal points in money datatype in SQL Server
                            
                                C# Regex Phone Number Check
                            
                                Which is preferred : Nested If's or Exit Sub / Function?
                            
                                DropdownList DataSource
                            
                                Linq: Get a list of all tables within DataContext
                            
                                Best way to represent a color in a SQL Database?
                            
                                How to sort a list of objects by a specific member?
                            
                                A simple implementation of a Singleton
                            
                                Get a list of weeks for a year - with dates
                            
                                get CultureInfo from a Currency Code?
                            
                                When to Dispose?
                            
                                How to use StreamReader in C# (newbie)
                            
                                Where can I find source code for .net CLR and C#?
                            
                                Struct v/s Class in C# - Please explain the behavior
                            
                                ASP.NET MVC - Forcing custom base class for Razor (cshtml) files
                            
                                How to display updated time as system time on a label using c#?
                            
                                Creating a mongodb capped collection using c# api
                            
                                How can I truncate a string using MVC Html Helpers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which is the fastest search technique/method? (In context of file searching)

Tags:

.net

search

filesystems

data-structures

full-text-search

Pale Blue Dot

People also ask

2 Answers

Kragen Javier Sitaker

Thomas Maierhofer

Recent Activity

Donate For Us