I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)
But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings
Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?
Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)
I think there must be one such algorithm (google does it!) but my searches found nothing.
Grep is a powerful utility available by default on UNIX-based systems. The name stands for Global Regular Expression Print. By using the grep command, you can customize how the tool searches for a pattern or multiple patterns in this case. You can grep multiple strings in different files and directories.
Grep Multiple Patterns To search for multiple patterns, use the OR (alternation) operator. The alternation operator | (pipe) allows you to specify different possible matches that can be literal strings or expression sets. This operator has the lowest precedence of all regular expression operators.
To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.
To also show you the lines before your matches, you can add -B to your grep. The -B 4 tells grep to also show the 4 lines before the match. Alternatively, to show the log lines that match after the keyword, use the -A parameter. In this example, it will tell grep to also show the 2 lines after the match.
This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.
Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.
One could implement the search like that:
(This is practically the same as the "distributed grep" problem they presented in their paper.)
The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.
For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.
reduce pretty much does nothing, so this is really easy.
Finished.
That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.
This is definitely not the only possible concept.
If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.
(The question is worded fairly broad. Any efficient solution highly depends on the specific assumptions made. For the sake of discussion, I will make some assumptions you didn't explicitly mention.)
Let's say...
f
files, w
words in total in those files,d
unique words (d
is the minimal size of a dictionary required to cover all files),q
words in the query, andr
files are in the result set of the query.I'll assume that q
<< d
<< f
<< w
(i.e. each variable is 'orders of magnitudes smaller' than its successor), and further that q
is essentially constant, i.e. O(1)
. I'll also assume that you're primarily interested in minimizing the amortized computation time measured in O(f)
or O(w)
, that you're willing to invest more memory for less computation time and that you expect to get queries fairly often.
Note that the runtime of the algorithm cannot be better than O(r)
, since we need to output each file belonging to the result set.
Create an index based on a hashmap from words to sets of files as follows:
index = {}
for file in files:
for word in file:
index[word] += file
This code runs in O(w)
, which is minimal (since you need to look at the complete input at least once). To find all files that contain all of the words in query
, run:
wordWithLeastFilesMatching = min(query, key=lambda word: len(index[word]))
result = set(index[wordWithLeastFilesMatching])
for word in query:
result = result.intersection(index[word])
return result
This code's runtime is essentially determined by the q
set intersections it needs to perform. In a typical case, each of the sets is probably O(log(f))
large and the overlap of the individual sets is moderate. In this case, the computation takes O(log(f))
.
In the worst case however, each of the sets is O(f)
large, even though the overlap (and therefore r
) is small. In this case, the computation would still take O(f)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With