search for multiple strings

Tags:

I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)

But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings

Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?

Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)

I think there must be one such algorithm (google does it!) but my searches found nothing.

899

asked Feb 11 '14 16:02

josinalvo

2 Answers

Parallel Programming

This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.

"MapReduce" as one concept

Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.

One could implement the search like that:

map: Distribute the documents together with the keyword to search for. If the search word is found in the current file, return the filename from the computing node. Otherwise return nothing.
reduce: Gather all filenames in a list from all nodes.

(This is practically the same as the "distributed grep" problem they presented in their paper.)

The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.

For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.

reduce pretty much does nothing, so this is really easy.

Finished.

That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.

Other concepts

This is definitely not the only possible concept.

It is possible to vary on what hardware you use (independent PCs like MapReduce supposes, or is it more like a supercomputer with dozens of CPU).
It is possible to vary on what distributed (or not distributed) filesystem you use.
It is possible to vary the programming language, which too can make a huge difference.

If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.

193

answered Oct 04 '22 20:10

Ulrich Thomas Gabor

(The question is worded fairly broad. Any efficient solution highly depends on the specific assumptions made. For the sake of discussion, I will make some assumptions you didn't explicitly mention.)

Model

Let's say...

There are f files,
w words in total in those files,
d unique words (d is the minimal size of a dictionary required to cover all files),
q words in the query, and
r files are in the result set of the query.

I'll assume that q << d << f << w (i.e. each variable is 'orders of magnitudes smaller' than its successor), and further that q is essentially constant, i.e. O(1). I'll also assume that you're primarily interested in minimizing the amortized computation time measured in O(f) or O(w), that you're willing to invest more memory for less computation time and that you expect to get queries fairly often.

Note that the runtime of the algorithm cannot be better than O(r), since we need to output each file belonging to the result set.

Algorithm

Create an index based on a hashmap from words to sets of files as follows:

Click to copy

index = {}
for file in files:
  for word in file:
    index[word] += file

This code runs in O(w), which is minimal (since you need to look at the complete input at least once). To find all files that contain all of the words in query, run:

Click to copy

wordWithLeastFilesMatching = min(query, key=lambda word: len(index[word]))
result = set(index[wordWithLeastFilesMatching])
for word in query:
  result = result.intersection(index[word])
return result

This code's runtime is essentially determined by the q set intersections it needs to perform. In a typical case, each of the sets is probably O(log(f)) large and the overlap of the individual sets is moderate. In this case, the computation takes O(log(f)).

In the worst case however, each of the sets is O(f) large, even though the overlap (and therefore r) is small. In this case, the computation would still take O(f).

answered Oct 04 '22 22:10

blubb

Related questions
                            
                                Numeric example of the Expectation Maximization Algorithm [duplicate]
                            
                                Sorting in linear time
                            
                                Top 5 values from three given arrays
                            
                                Hashing and encryption technique for a huge data set containing phone numbers
                            
                                Lehmer's extended GCD algorithm implementation
                            
                                Using binary indexed trees for a RMQ extension
                            
                                Weighted bipartite matching
                            
                                How can i determine if a string is a concatenation of a string list
                            
                                Text, string-based chord recognition algorithms?
                            
                                Subset sum for large sums
                            
                                What can a second year computer undergrad do which might be considered worthwhile in the future? [closed]
                            
                                How to "sort" elements of 2 possible values in place in linear time? [duplicate]
                            
                                What's it called when I want to choose items to fill container as full as possible - and what algorithm should I use?
                            
                                Subtraction of sorted data
                            
                                Parallel sorting of a singly-linked list
                            
                                How to find and print intersections of n given circles in O(n*log(n)) time?
                            
                                Efficient search of sorted numerical values
                            
                                Threaded Binary Search Trees Advantage
                            
                                Create combined Lists from multiple Lists
                            
                                Big-O for a compiler [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

search for multiple strings

Tags:

algorithm

complexity-theory

search

full-text-search