Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# Fastest string search in all files

Problem (Check edit for clarifications)

I have a list of about 1500 strings and for each of these strings I have to check if any of the files in a directory (and subdirectories) contains that string (there are about 4000 files).

Code

What I have now are these two variants:

Original

foreach(var str in stringList)
{
    allFiles.Any(f => File.ReadAllText(f).Contains(str));
}

Second variant (using ReadLines instead of ReadAllText, as suggested from VladL in this question)

foreach(var string in stringList)
{
    allFiles.SelectMany(File.ReadLines).Any(line => line.Contains(str));
}

I only tested the complete program execution of the original variant and it took 21 minutes to finish. I then tested a single statement (check if 1 string is contained in any file) searching for a string that I knew it wasn't contained to check the worst case scenario, and this are my timings (executed each 3 times):

Original: 1285, 1369, 1336 ms

Second variant: 2718, 2804, 2831 ms

I also tryed to replace ReadAllText with ReadAllLines in the Original statement (without changing anything else), but with no performance changes.

Question

Is there any faster way for checking if a string is contained in any file (large amount of large files)?

Edit

I admit I didn't expressed myself as clear as I wanted, by saying I have a list of strings. What I actually have is a list of csv files, I then itarate trhough those and then iterate through each line of these file (ignoring the first line). With each line I create a string composing it with some of the fields of the line, and then look if any file contains that string.

foreach(var csvFile in csvFiles)
{
    var lines = File.ReadAllLines(csvFile);
    foreach(var line in lines)
    {
        if (IsHeader(line)) continue;
        var str = ComposeString(line);
        var bool = allFiles.Any(f => File.ReadAllText(f).Contains(str));
        // do stuff with the line and bool
     }
 }

Edit 2

public void ExecuteAhoCorasick()
{
    var table = CreateDataTable();
    var allFiles = GetAllFiles();
    var csvFiles = GetCsvFiles();
    var resList = new List<string>();

    foreach(var csvFile in csvFiles)
    {
        if (file.Contains("ValueList_")) continue;
        var lines = File.ReadAllLines(file);
        foreach (var line in lines)
        {
            if (line == HeaderLine) continue;
            var res = line.Split(';');
            if (res.Length <= 7) continue;
            var resPath = $"{res[0]}.{res[1]}.{res[2]}".Trim('.');
            resList.Add(resPath);

            var row = table.NewRow();
            row[0] = res[0]; // Group
            row[1] = res[1]; // Type
            row[2] = res[2]; // Key
            row[3] = res[3]; // Global
            row[4] = res[4]; // De
            row[5] = res[5]; // Fr
            row[6] = res[6]; // It
            row[7] = res[7]; // En
            row[8] = resPath; // Resource Path
            row[9] = false;
            row[10] = ""; // Comment
            row[11] = file; // File Path

            table.Rows.Add(row);
        }
    }

    var foundRes = new List<string>();

    foreach (var file in allFiles)
    {
        // var chars = File.ReadLines(file).SelectMany(line => line);
        var text = File.ReadAllText(file);

        var trie = new Trie();
        trie.Add(resList);

        foundRes.AddRange(trie.Find(text));
        // foundRes.AddRange(trie.Find(chars));
    }

    // update row[9] to true foreach res in foundRes
}
like image 257
Century Avatar asked Sep 21 '17 08:09

Century


2 Answers

I think the fastest way to do this will be to:

  1. Read each file completely into memory. This simplifies the code.
  2. Use the Aho-Corasick algorithm to search for the keywords in the text for each file.

An implementation of Aho-Corasick is available here.

I've written a simple program using that implementation from Github that tests the worst-case performance (that is, when none of the keywords are present in the text) to compare Aho-Corasick with Contains()):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using ConsoleApp1;

namespace Demo
{
    class Program
    {
        static void Main()
        {
            string[] needles = createNeedles(1500).ToArray();
            string haystack = createHaystack(100000);

            var sw = Stopwatch.StartNew();
            anyViaContains(needles, haystack);
            Console.WriteLine("anyViaContains() took " + sw.Elapsed);

            sw.Restart();
            anyViaAhoCorasick(needles, haystack);
            Console.WriteLine("anyViaAhoCorasick() took " + sw.Elapsed);
        }

        static bool anyViaContains(string[] needles, string haystack)
        {
            return needles.Any(haystack.Contains);
        }

        static bool anyViaAhoCorasick(string[] needles, string haystack)
        {
            var trie = new Trie();
            trie.Add(needles);
            trie.Build();
            return trie.Find(haystack).Any();
        }

        static IEnumerable<string> createNeedles(int n)
        {
            for (int i = 0; i < n; ++i)
                yield return n + "." + n + "." + n;
        }

        static string createHaystack(int n)
        {
            var sb = new StringBuilder();

            for (int i = 0; i < n; ++i)
                sb.AppendLine("Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text");

            return sb.ToString();
        }
    }
}

The results I get for a 64-bit RELEASE build (run outside the debugger) are as follows:

anyViaContains() took 00:00:09.8216836

anyViaAhoCorasick() took 00:00:00.4002765

For this test case, it appears that Aho-Corasick is around 25 times faster than using Contains(). However, this is a somewhat artificial test case and your actual results may vary. You should instrument with your actual data to see if it really does help.

Note that you can actually avoid loading the entire file into memory when using the Aho-Corasick implementation, because its Find() method accepts an IEnumerable<char>.

You can turn a the contents of a file into an IEnumerable<char> as follows:

var chars = File.ReadLines(filename).SelectMany(line => line);

That actually removes all the newline characters, which is probably OK for your application. If you wanted to keep the newline characters, you'd have to put them back like so:

IEnumerable<char> newline = Enumerable.Repeat('\n', 1);
var chars = File.ReadLines(filename).SelectMany(line => Enumerable.Concat(line, newline));

It would be worth comparing loading each file completely into memory with enumerating the chars in each file (as above) to see if there is any difference. For very large files, it may be important to avoid loading their entire contents into memory.

like image 103
Matthew Watson Avatar answered Sep 30 '22 00:09

Matthew Watson


Does file contains any of the string?

private static bool ContainsLine(string file, List<string> wordsToFind) {
  return File
    .ReadLines(file)
    .Any(line => wordsToFind.Any(word => line.Contains(word))); 
}

Do we have any file which contain any string?

bool result = allFiles
  .AsParallel() // worth trying: we have a lot of files to be proceed
  .Any(file => ContainsLine(file, stringList));

Edit: Often .AsParallel() is worth trying for the problems like this (many files to test), however, in case AsParallel() doesn't bring any gain, just comment it out:

bool result = allFiles
  //.AsParallel() // comment out in case of no gain
  .Any(file => ContainsLine(file, stringList));
like image 42
Dmitry Bychenko Avatar answered Sep 30 '22 01:09

Dmitry Bychenko