Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast way to parse through a large file with regex?

Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.

Example file size is 148,208KB

I am using regex to parse through every line:

Here is my c# code:

private static void ReadTheLines(int max, Responder rp, string inputFile)
{
    List<int> rate = new List<int>();
    double counter = 1;
    try
    {
        using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024))
        {
            string line;
            Console.WriteLine("Reading....");
            while ((line = sr.ReadLine()) != null)
            {
                if (counter <= max)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
                else if (max == 0)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
            }
            rp.GetRate(rate);
            Console.ReadLine();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

Here is my regex:

public List<int> GetRateLine(string justALine)
{
    const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$";
    Match match = Regex.Match(justALine, reg,
                                RegexOptions.IgnoreCase);

    // Here we check the Match instance.
    if (match.Success)
    {
        // Finally, we get the Group value and display it.

        string theRate = match.Groups[3].Value;
        Ratestorage.Add(Convert.ToInt32(theRate));
    }
    else
    {
        Ratestorage.Add(0);
    }
    return Ratestorage;
}

Here is an example line to parse, usually around 200,000 lines:

10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789

like image 450
Rayshawn Avatar asked Dec 10 '12 22:12

Rayshawn


2 Answers

Memory Mapped Files and Task Parallel Library for help.

  1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
  2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
  3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
  4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
  5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4

like image 117
sll Avatar answered Nov 12 '22 04:11

sll


Right now, you recreate your Regex each time you call GetRateLine, which occurs every time you read a line.

If you create a Regex instance once in advance, and then use the non-static Match method, you will save on regex compilation time, which could potentially give you a speed gain.

That being said, it will likely not take you from minutes to seconds...

like image 26
Reed Copsey Avatar answered Nov 12 '22 04:11

Reed Copsey