Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use regex in grep to match multiple lines and only get the last matched set?

I have a file with some statistics like this

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

and I would like to cut out the last set of statistics (in the example above it would be the last 6 lines). As you can see, the amount of lines for each section can change, but the first and the last line stay constant. I was thinking about using:

  • "TOTAL" as an anchor point to grab the first and the last line of the wanted block of text
  • (?s) mode to match all lines in between those two

I ended up with this regex (?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?$ and to use it in Linux, I used this command to get the wanted output using -P regex extension for grep (I haven't had much luck with -E regex extension)

tac con.log | grep -Po "(?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?\$" -m1 | tac

which resulted in this correct output

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

as expected, however this was in my testing environment which uses an old grep version 2.5.3 and when I tried it on my other machine running on Rocky Linux 9, which uses grep version 3.6 I am not getting any match. Considering this regex worked also when testing at regex101.com, I believe this might be a nuance of a newer grep. Is there anything special these newer versions of grep require for a regex like this to work or is there any other way how to get this result (ultimately, it will be used in a bash script)?

like image 780
justabitjanky Avatar asked Sep 17 '25 12:09

justabitjanky


1 Answers

With Perl, one way

perl -0777 -wnE'$r = $1 while /(^[0-9\s:-]+TOTAL.+? TOTAL.+?$)/smxg; say $r' file

or

perl -0777 -wnE'say for /.*( ^[0-9\s:-]+ TOTAL.+? TOTAL.+?$ )/smxg' file

This does capture and assign all such records, or matches the whole file, until it gets to the last one, but one has to go over the file; the approach from the question makes three passes over the file. We can process backwards if performance is an issue, like here for example. See the performance effect here.

Altogether I'd recommend a short script instead.

Not sure why grep does what you show; I'd imagine that the above regex should work, even slightly simplified using grep's conventions.


In the question as originally posted by the OP there was a perl tag.

like image 60
zdim Avatar answered Sep 19 '25 04:09

zdim