I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is.
As an example, Intellij's searching algorithm for standard (non-regex) expressions is pretty much immediate. How do they accomplish this? Are they just keeping some sort of suffix-tree of all the searchable files in memory? Are they just keeping a good portion of the file's contents in memory so they just do a standard KMP almost fully in-memory to avoid any disk IO?
Thanks
This algorithm, which Bob Boyer and I invented in about 1975, is the basis of the fastest known ways to find one string of characters in another.
Results: The Boyer-Moore-Horspool algorithm achieves the best overall results when used with medical texts. This algorithm usually performs at least twice as fast as the other algorithms tested. Conclusion: The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used.
The Boyer–Moore string-search algorithm has been the standard benchmark for the practical string-search literature.
Currently, IntelliJ IDEA indexes files in the project, and remembers which 3-grams (sequences of 3 letters or digits) occur in which files. When searching, it splits the query into 3-grams as well, gets the files from the index that contain all those trigrams, intersects those sets and uses a relatively straightforward text search in each of those files to check if they really contain the whole search string.
As js441 pointed out Apache Lucene is a good option but only if you are going to do term based search, similar to how google works. If you need to search arbitrary strings that span the terms Lucene will not help you.
In the later case you are right, you have to build some sort of suffix tree. A neat trick you can do after you have built a suffix tree is to write it to the file and mmap it into memory space. This way you will not waste memory to keep entire tree in RAM, but you will have frequently accessed portions of the tree automatically cached. The drawback to mmap is that initial searches might be somewhat slow. Also this will not if your files change often.
To help the case of searching just edited files, you can keep two indices, one for the bulk of your files and another one just for the recently edited files. So when you do the search you will search in both indices. Periodically you should rebuild the permanent index with the contents of the new files and replace the old one.
Here are some examples of when Lucene is good and when suffix tree is good:
Assume you have a document that contains the following:
A quick brown dog has jumped over lazy fox.
Lucene is good for the following searches:
q* b
With some tricks you can make the following searches work well:
'*ick *own'
This type of search will run very slow
'q*ick brown d*g'
And this type of search will never find anything
"ick brown d"
Lucene is also good when you treat your documents as bags of words. So you can easily do searches like this
quick fox
Which will find you all documents that have words quick and fox no matter what is in the middle.
On the other hand suffix trees work well with search for exact matches of substrings within the document, even in cases when your search is spans the terms and starts and ends in the middle of the term.
Very good algorithm for constructing suffix trees of large arrays is described here (Warnign paywalled).
You could take a look at Apache Lucene. It's a text search engine library written entirely in java. It may be a little bit too heavy for your use, but since it's open source, you could take a look at how it works.
It features a demo which leads you to build an index and search through the library source code, which sounds pretty much exactly like what you want to do.
Also, take a look at the Boyer-Moore string search algorithm. This is apparently commonly used in applications which offer a ctrl+f style document search. It involves pre-processing the search term so it can run as few comparisons as possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With