Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read in a file using a regular expression?

This is tangentially related to an earlier question of mine.

Essentially, the solution in that question worked great, but now I need to adapt it to work in a much larger analysis application. Simply using StreamReader.ReadToEnd() is not acceptable, since some of the files I will be reading in are very, very large. If there's been a mistake and someone forgot to clean up, they can theoretically be gigabytes big. Obviously I can't just read to the end of that.

Unfortunately, the normal read lines is also not acceptable, because some of the rows of data I am reading in contain stack traces - they obviously use /r/n in their formatting. Ideally, I would like to tell the program to read forward until it hits a match for a regex, which it then returns. Is there any functionality to do this in .net? If not, can I get some suggestions for how I'd go about writing it?

Edit: To make it a bit easier to follow my question, here's a paste of some of the important parts of the adapted code:

foreach (var fileString in logpath.Select(log => new StreamReader(log)).Select(fileStream => fileStream.ReadToEnd()))
{
    const string junkPattern = @"\[(?<junk>[0-9]*)\] \((?<userid>.{0,32})\)";
    const string severityPattern = @"INFO|ERROR|FATAL";
    const string datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
    var records = Regex.Split(fileString, datePattern, RegexOptions.Multiline);
    foreach (var record in records.Where(x => string.IsNullOrEmpty(x) == false))
    ......

The problem lies in the Foreach. .Select(fileStream => fileStream.ReadToEnd()) is gonna blow up memory badly, I just know it.

like image 566
tmesser Avatar asked Nov 13 '22 18:11

tmesser


1 Answers

First off all, you should move your const definition to class declaration - the compiler will do that for you, but this should be done by yourself, just for better code readability.

As @Blam mentioned, you should use StringBuilder and StreamReader.ReadLine in pair, something like this:

foreach(var filePath in logpath)
{
    var sbRecord = new StringBuilder();
    using(var reader = new StreamReader(filePath))
    {
        do
        {
            var line = reader.ReadLine();
            // check start of the new record lines
            if (Regex.Match(line, datePattern) && sbRecord.Length > 0)
            {
                // your method for log record
                HandleRecord(sbRecord.ToString());
                sbRecord.Clear();
                sbRecord.AppendLine(line);
            }
            // if no lines were added or datePattern didn't hit
            // append info about current record
            else
            {
                sbRecord.AppendLine(line);
            }
        } while (!reader.EndOfStream)
    }
}

If I didn't understand something about your problem, please clarify this in comment.
Also, you can use ThreadPool for schedule the tasks for your lines, just for speed of your application.

like image 68
VMAtm Avatar answered Nov 15 '22 09:11

VMAtm