Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex grouping and optional matches

Firstly: I'm not strong with regex. Now, that's on the table. I am working on building a regex that uses groups, and optional components. The issue I have, is that I need to match a certain number in two different areas, and give them the same group name. This does not appear to work.

So the specific details. I am analyzing a garbage collection log from a JVM. The two lines in question are a full GC, and a regular GC.

I broke these up to make them readable.

Full line:

229980.058: [Full GC 229980.058: 
            [CMS: 2796543K->2796543K(2796544K), **13.3050667** secs]
            2983863K->2872464K(4067264K), 
            [CMS Perm : 325367K->325242K(1048576K)], 13.3054416 secs] 
            [Times: user=13.27 sys=0.03, real=13.31 secs] 

Regular line:

2.752: [GC 2.752: 
       [ParNew: 1143680K->4938K(1270720K), **0.0243534** secs] 
       1143686K->4945K(4067264K), 0.0245283 secs] 
       [Times: user=0.05 sys=0.02, real=0.03 secs] 

As you can see, the Full GC has a CMS/tenured generation as the first field area. The second one has does not have these, as it's just the regular collection.

In order for these to be captured, correcty I've made both the "CMS:" and "ParNew:" section optional to each other. However, I want to pull the time out of each as one group name. (The values I put ** around)

I'm using this regex:

\d+.\d+: [(Full\s)?GC\s\d+.\d+: [(CMS:\s(?<JVM_TenuredGenHeapUsedBeforeGC>\d+)+K->(?<JVM_TenuredHeapUsedAfterGC>\d+)K(\d+K),\s(?<JVM_GCTimeTaken>\d+.\d+)\ssecs)? (ParNew:\s(?\d+)+K->(?<JVM_NewGenHeapUsedAfterGC>\d+)K((?<JVM_NewGenHeapSize>\d+)K),\s(?<JVM_GCTimeTaken>\d+.\d+)\ssecs)?] .. [edited for brevity]

In short.. Is it possible to use the same group name on different optional matches? They will never be on the same line, so I don't know why I can't pull this of.

Testing this with regexr also seems to fail. Thanks!

like image 530
jgauthier Avatar asked Nov 18 '13 18:11

jgauthier


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What does grouping do in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

What is match and group in regex?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.


2 Answers

The issue I have, is that I need to match a certain number in two different areas, and give them the same group name.

I'd say that's the problem. I haven't tried this, but I saw the change list introducing named groups, and that's just naming a numbered group. So it can't work.

Give them different names and use something like

Objects.firstNonNull(m.group("foo"), m.group("bar"))

if you're sure that at least one of them is non-null (otherwise you get an NPE). Or write your own null-accepting one-liner.

like image 160
maaartinus Avatar answered Oct 09 '22 09:10

maaartinus


A little experimentation shows that Java does not allow you to define the same capturing group name twice within a regex. The following code generates the following exception:

public class NamedCapturingGroupMain {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(?<mygroup>a)|(?<mygroup>b)");
    }
}

Exception:

Exception in thread "main" java.util.regex.PatternSyntaxException: Named capturing group <mygroup> is already defined near index 24

The easiest thing to do here would probably be to define two different capturing group names, and use the second one if the first one is null. For example, if you used "JVM_GCTimeTakenFull" and "JVM_GCTimeTakenPartial" and then do something like:

String gcTimeTaken = matcher.group("JVM_GCTimeTakenFull");
if (gcTimeTaken == null) {
    gcTimeTaken = matcher.group("JVM_GCTimeTakenPartial");
}
like image 23
Mike Clark Avatar answered Oct 09 '22 09:10

Mike Clark