Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for multiple strings in multiple files

I have a text file containing 21000 strings (one line each) and 500 MB of other text files (maily source codes). For each string I need to determine if it is contained in any of those files. I wrote program that does the job but its performance is terrible (it would do that in couple of days, I need to have the job done in 5-6 hours max).
I'm writing using C#, Visual Studio 2010

I have couple of questions regarding my problem:
a) Which approach is better?

foreach(string s in StringsToSearch)
{
    //scan all files and break when string is found
}

or

foreach(string f in Files)
{
    //search that file for each string that is not already found
}

b) Is it better to scan one file line by line

StreamReader r = new StreamReader(file);
while(!r.EndOfStream)
{
    string s = r.ReadLine();
    //... if(s.Contains(xxx));
}

or

StreamReader r = new StreamReader(file);
string s = r.ReadToEnd();
//if(s.Contains(xxx));

c) Would threading improve performance and how to do that?
d) Is there any software that can do that so I don't have to write my own code?

like image 646
Ichibann Avatar asked Oct 21 '10 12:10

Ichibann


People also ask

How do I search for a string in multiple files?

To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.

Can you grep all files in a directory?

You can make grep search in all the files and all the subdirectories of the current directory using the -r recursive search option: grep -r search_term .


2 Answers

If you are just wanting to know if the string is found or not found, and don't need to do any further processing, then I'd suggest you just use grep. Grep is extremely fast and designed for exactly this kind of problem.

grep -f strings-file other-files...

should do the trick. I'm sure there is a Windows implementation out there somewhere. At worst, Cygwin will have it.

EDIT: This answers question d)

like image 187
Cameron Skinner Avatar answered Sep 29 '22 08:09

Cameron Skinner


You want to minimize of File I/O, so your first idea is very bad because you would be opening the 'other' files up to 21.000 times. You want to use something based on the second one (a1). And when those other files aren't overly big, load them into memory once with readAllText.

List<string> keys = ...;    // load all strings

foreach(string f in Files)
{
    //search for each string that is not already found
    string text = System.IO.File.ReadAllText(f);  //easy version of ReadToEnd


    // brute force
    foreach(string key in keyes)
    {
        if (text.IndexOf(key) >= 0) ....
    }

}

The brute force part can be improved upon but I think you will find it acceptable.

like image 25
Henk Holterman Avatar answered Sep 29 '22 08:09

Henk Holterman