Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Raku Grammar: Use named regex without consuming matching string

Tags:

grammar

raku

I have a probably easy to answer Raku grammar question. I wont to parse a log file and get back the entries log entry by log entry. A log entry can be just a line or a multi line string.

My draft code looks like this:

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO ' | 'ERROR' }
    token logentry { .*? <.finish> }
    token finish { <.loglevel> || $ }
}

That works for just the first line because in the second line the loglevel is consumed by the first line match although I used '.' in the regex <> that as far as I know means non-capturing.

Following are a log example:

INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Current process-name: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Session data:
    PID: 1234
    TID: 1234
    Session: 1
INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Clean up.

What would be the right approach to get back the log entries even for multi line ones? Thanks!

like image 386
user13195651 Avatar asked May 22 '20 12:05

user13195651


1 Answers

The .*? works but is inefficient.
It has to do a lot of backtracking.

To improve it you could use \N* which matches everything except a newline.

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { \N* \n }
}

Then you would have to add the newline matching back in.

    token logentry {
      <logline>* %% \n
    }
    token logline { <!before \w> \N* }

This would work, but it still isn't great.


I would structure the grammar more like the thing you are trying to parse.

grammar Grammar::Entries {
    token TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

Since I noticed that the log lines always start with 4 spaces, we can use that to make sure that only lines that start with that are counted as a logline. This also deals with the remaining data on the line with the log level.

I really don't like that you have a token with a plural name that only matches one thing.
Basically I would name logentries as logentry. Of course that means that logentry needs to change names as well.

grammar Grammar::Entries {
    token TOP { <logentry>+ }

    token logentry { <loglevel> <logdata> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logdata { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

I also don't like the redundant log appended to every token.

grammar Grammar::Entries {
    token TOP { <entry>+ }

    token entry { <level> <data> }
    token level { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token data { <line>* }
    token line { '    ' <(\N+)> \n? }
}

So what this says is that a Grammar::Entries consist of at least one entry.
An entry starts with a level, and ends with some data.
data consists of any number of lines
A line starts with four spaces, at least one non-newline, and may end with a newline.


The point I'm trying to make is to structure the grammar the same way that the data is structured.

You could even go and add the structure for pulling out the information so that you don't have to do that as a second step.

like image 62
Brad Gilbert Avatar answered Sep 28 '22 02:09

Brad Gilbert