I have a probably easy to answer Raku grammar question. I wont to parse a log file and get back the entries log entry by log entry. A log entry can be just a line or a multi line string.
My draft code looks like this:
grammar Grammar::Entries {
rule TOP { <logentries>+ }
token logentries { <loglevel> <logentry> }
token loglevel { 'DEBUG' | 'WARN' | 'INFO ' | 'ERROR' }
token logentry { .*? <.finish> }
token finish { <.loglevel> || $ }
}
That works for just the first line because in the second line the loglevel is consumed by the first line match although I used '.' in the regex <> that as far as I know means non-capturing.
Following are a log example:
INFO 2020-01-22T11:07:38Z PID[8528] TID[6736]: Current process-name: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
INFO 2020-01-22T11:07:38Z PID[8528] TID[6736]: Session data:
PID: 1234
TID: 1234
Session: 1
INFO 2020-01-22T11:07:38Z PID[8528] TID[6736]: Clean up.
What would be the right approach to get back the log entries even for multi line ones? Thanks!
The .*?
works but is inefficient.
It has to do a lot of backtracking.
To improve it you could use \N*
which matches everything except a newline.
grammar Grammar::Entries {
rule TOP { <logentries>+ }
token logentries { <loglevel> <logentry> }
token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
token logentry { \N* \n }
}
Then you would have to add the newline matching back in.
token logentry {
<logline>* %% \n
}
token logline { <!before \w> \N* }
This would work, but it still isn't great.
I would structure the grammar more like the thing you are trying to parse.
grammar Grammar::Entries {
token TOP { <logentries>+ }
token logentries { <loglevel> <logentry> }
token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
token logentry { <logline>* }
token logline { ' ' <(\N+)> \n? }
}
Since I noticed that the log lines always start with 4 spaces, we can use that to make sure that only lines that start with that are counted as a logline
. This also deals with the remaining data on the line with the log level.
I really don't like that you have a token with a plural name that only matches one thing.
Basically I would name logentries
as logentry
. Of course that means that logentry
needs to change names as well.
grammar Grammar::Entries {
token TOP { <logentry>+ }
token logentry { <loglevel> <logdata> }
token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
token logdata { <logline>* }
token logline { ' ' <(\N+)> \n? }
}
I also don't like the redundant log
appended to every token.
grammar Grammar::Entries {
token TOP { <entry>+ }
token entry { <level> <data> }
token level { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
token data { <line>* }
token line { ' ' <(\N+)> \n? }
}
So what this says is that a Grammar::Entries
consist of at least one entry
.
An entry
starts with a level
, and ends with some data
.data
consists of any number of line
s
A line
starts with four spaces, at least one non-newline, and may end with a newline.
The point I'm trying to make is to structure the grammar the same way that the data is structured.
You could even go and add the structure for pulling out the information so that you don't have to do that as a second step.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With