Fast substring search algorithm to be used by a sort of IDE with tens of thousands of very big files

Tags:

I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is.

As an example, Intellij's searching algorithm for standard (non-regex) expressions is pretty much immediate. How do they accomplish this? Are they just keeping some sort of suffix-tree of all the searchable files in memory? Are they just keeping a good portion of the file's contents in memory so they just do a standard KMP almost fully in-memory to avoid any disk IO?

Thanks

616

asked Sep 04 '16 20:09

devoured elysium

3 Answers

Currently, IntelliJ IDEA indexes files in the project, and remembers which 3-grams (sequences of 3 letters or digits) occur in which files. When searching, it splits the query into 3-grams as well, gets the files from the index that contain all those trigrams, intersects those sets and uses a relatively straightforward text search in each of those files to check if they really contain the whole search string.

answered Oct 25 '22 19:10

Peter Gromov

As js441 pointed out Apache Lucene is a good option but only if you are going to do term based search, similar to how google works. If you need to search arbitrary strings that span the terms Lucene will not help you.

In the later case you are right, you have to build some sort of suffix tree. A neat trick you can do after you have built a suffix tree is to write it to the file and mmap it into memory space. This way you will not waste memory to keep entire tree in RAM, but you will have frequently accessed portions of the tree automatically cached. The drawback to mmap is that initial searches might be somewhat slow. Also this will not if your files change often.

To help the case of searching just edited files, you can keep two indices, one for the bulk of your files and another one just for the recently edited files. So when you do the search you will search in both indices. Periodically you should rebuild the permanent index with the contents of the new files and replace the old one.

Here are some examples of when Lucene is good and when suffix tree is good:

Assume you have a document that contains the following:

A quick brown dog has jumped over lazy fox.

Lucene is good for the following searches:

quick
quick brown
q*
q* b

With some tricks you can make the following searches work well:
'*ick *own'

This type of search will run very slow
'q*ick brown d*g'

And this type of search will never find anything
"ick brown d"

Lucene is also good when you treat your documents as bags of words. So you can easily do searches like this
quick fox

Which will find you all documents that have words quick and fox no matter what is in the middle.

On the other hand suffix trees work well with search for exact matches of substrings within the document, even in cases when your search is spans the terms and starts and ends in the middle of the term.

Very good algorithm for constructing suffix trees of large arrays is described here (Warnign paywalled).

answered Oct 25 '22 19:10

Vlad

You could take a look at Apache Lucene. It's a text search engine library written entirely in java. It may be a little bit too heavy for your use, but since it's open source, you could take a look at how it works.

It features a demo which leads you to build an index and search through the library source code, which sounds pretty much exactly like what you want to do.

Also, take a look at the Boyer-Moore string search algorithm. This is apparently commonly used in applications which offer a ctrl+f style document search. It involves pre-processing the search term so it can run as few comparisons as possible.

answered Oct 25 '22 18:10

js441

Related questions
                            
                                Spring security Authorize Requests value from database
                            
                                Exception unable to validate certificate of the target in spring MVC
                            
                                How to import existing java project into android studio?
                            
                                Is it possible to change WebView of JavaFX to Chrome
                            
                                Please Explain Java 8 Method Reference to instance Method using class name
                            
                                How does fallback work with socket.io?
                            
                                How to Modify a javax.json.JsonObject Object?
                            
                                Spark: driver/worker configuration. Does driver run on Master node?
                            
                                Dismiss current notification on Action clicked
                            
                                G1: What are the differences between mixed gc and full gc?
                            
                                how to throw an exception if the object that is being added has a duplicate
                            
                                Sorting an array of filenames containing strings with numbers
                            
                                What is the order in which stream operations are applied to list elements? [duplicate]
                            
                                Error registering PhantomJS node in Selenium Grid
                            
                                What should I put in the target directory?
                            
                                How to pass Table-Valued parameters from java to sql server stored procedure?
                            
                                Java Compilation Speed
                            
                                StringBuilder constructor accepts a StringBuilder object - why?
                            
                                Implementing custom validation logic for a spring boot endpoint using a combination of JSR-303 and Spring's Validator
                            
                                Java generic builder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast substring search algorithm to be used by a sort of IDE with tens of thousands of very big files

Tags:

java

algorithm

intellij-idea

string-algorithm