Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

search for multiple strings

I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)

But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings

Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?

Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)

I think there must be one such algorithm (google does it!) but my searches found nothing.

like image 899
josinalvo Avatar asked Feb 11 '14 16:02

josinalvo


People also ask

Can you grep multiple strings?

Grep is a powerful utility available by default on UNIX-based systems. The name stands for Global Regular Expression Print. By using the grep command, you can customize how the tool searches for a pattern or multiple patterns in this case. You can grep multiple strings in different files and directories.

How do you grep 2 patterns at a time?

Grep Multiple Patterns To search for multiple patterns, use the OR (alternation) operator. The alternation operator | (pipe) allows you to specify different possible matches that can be literal strings or expression sets. This operator has the lowest precedence of all regular expression operators.

Can grep search multiple files?

To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.

How do you grep with additional lines?

To also show you the lines before your matches, you can add -B to your grep. The -B 4 tells grep to also show the 4 lines before the match. Alternatively, to show the log lines that match after the keyword, use the -A parameter. In this example, it will tell grep to also show the 2 lines after the match.


2 Answers

Parallel Programming

This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.

"MapReduce" as one concept

Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.

One could implement the search like that:

  • map: Distribute the documents together with the keyword to search for. If the search word is found in the current file, return the filename from the computing node. Otherwise return nothing.
  • reduce: Gather all filenames in a list from all nodes.

(This is practically the same as the "distributed grep" problem they presented in their paper.)

The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.

For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.

reduce pretty much does nothing, so this is really easy.

Finished.

That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.

Other concepts

This is definitely not the only possible concept.

  • It is possible to vary on what hardware you use (independent PCs like MapReduce supposes, or is it more like a supercomputer with dozens of CPU).
  • It is possible to vary on what distributed (or not distributed) filesystem you use.
  • It is possible to vary the programming language, which too can make a huge difference.

If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.

like image 193
Ulrich Thomas Gabor Avatar answered Oct 04 '22 20:10

Ulrich Thomas Gabor


(The question is worded fairly broad. Any efficient solution highly depends on the specific assumptions made. For the sake of discussion, I will make some assumptions you didn't explicitly mention.)

Model

Let's say...

  • There are f files,
  • w words in total in those files,
  • d unique words (d is the minimal size of a dictionary required to cover all files),
  • q words in the query, and
  • r files are in the result set of the query.

I'll assume that q << d << f << w (i.e. each variable is 'orders of magnitudes smaller' than its successor), and further that q is essentially constant, i.e. O(1). I'll also assume that you're primarily interested in minimizing the amortized computation time measured in O(f) or O(w), that you're willing to invest more memory for less computation time and that you expect to get queries fairly often.

Note that the runtime of the algorithm cannot be better than O(r), since we need to output each file belonging to the result set.

Algorithm

Create an index based on a hashmap from words to sets of files as follows:

index = {}
for file in files:
  for word in file:
    index[word] += file

This code runs in O(w), which is minimal (since you need to look at the complete input at least once). To find all files that contain all of the words in query, run:

wordWithLeastFilesMatching = min(query, key=lambda word: len(index[word]))
result = set(index[wordWithLeastFilesMatching])
for word in query:
  result = result.intersection(index[word])
return result

This code's runtime is essentially determined by the q set intersections it needs to perform. In a typical case, each of the sets is probably O(log(f)) large and the overlap of the individual sets is moderate. In this case, the computation takes O(log(f)).

In the worst case however, each of the sets is O(f) large, even though the overlap (and therefore r) is small. In this case, the computation would still take O(f).

like image 43
blubb Avatar answered Oct 04 '22 22:10

blubb