Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching in text files for a keyword until a string is encountered

I'm writing a program to help me search for a keyword inside thousands of files. Each of these files has unnecessary lines that i need to ignore because they mess with the results. Luckily they're all located after a specific line inside those files.
What i've already got is a search, without ignoring the lines after that specific line, returning an Enumerable of the file names containing the keyword.

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .Any(line => line.Contains(keyWord)))
                                            .Select(file => file.FullName);

Is there a simple and fast way to implement this functionality? It doesn't necessarily have to be in Linq as i'm not even sure if this would be possible.

Edit:
An example to make it clearer. This is how the text files are structured:
xxx
xxx
string
yyy
yyy

I want to search the xxx lines until either the keyword is found or the string and then skip to the next file. The yyy lines i want to ignore in my search.

like image 856
drouning Avatar asked Jul 30 '15 08:07

drouning


3 Answers

Try this:

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .TakeWhile(line => line != "STOP")
                                            .Any(line => line.Contains(keyWord)))
                                            .Select(file => file.FullName);
like image 128
Ghasan غسان Avatar answered Oct 22 '22 10:10

Ghasan غسان


You can process files in parallel, just add AsParallel() after "files". This should improve files processing speed. ReadLines does not read the whole file before search, so it should work as you expect.

EDIT: sorry misread your question first time and haven't noticed stop word. Given that I think it would be more easy to avoid LINQ:

        IEnumerable<FileInfo> parallelFiles = files.AsParallel();
        var result = new ConcurrentBag<string>();
        foreach (var file in parallelFiles)
        {
            foreach (string line in File.ReadLines(file.FullName))
            {
                if (line.Contains(keyWord))
                {
                    result.Add(file.FullName);
                    break;
                }
                else if (line.Contains(stopWord))
                {
                    break;
                }
            }
        }
like image 1
sarh Avatar answered Oct 22 '22 09:10

sarh


It's only a minor modification: ignore the lines that don't contain the search string and only read the first occurrence:

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .TakeWhile(line => != myString)
                                            .Any(line => line.IndexOf(keyWord) > -1)
                               )
                         .Select(file => file.FullName);
like image 1
Gert Arnold Avatar answered Oct 22 '22 11:10

Gert Arnold