Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I specify an optional capture group in this RegEx?

Tags:

.net

regex

How can I fix this RegEx to optionally capture a file extension?

I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)


My RegEx (.NET Flavor) is as follows:

.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.*                   # Ignore some garbage in the front
(header_             # Match the start of the file name,
    \d{10,11}_)      #     including the ID (10 - 11 digits)
.*                   # Ignore the type code in the middle
(_.*_\d{8})          # Match some random characters, then an 8-digit date
.*                   # Ignore anything between this and the file extension
(\.\w{3,4})          # Match the file extension, 3 or 4 characters long
.*                   # Ignore the rest of the string


I expect this to match strings like:

str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"


Where the capture groups return something like:

$1  =  header_0000000602_
$2  =  _mc2e1nrobr1a3s55niyrrqvy_20081212
$3  =  .doc


Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.

If I add "?" to the end of the third capture group "(.\w{3,4})?", the RegEx no longer captures $3 for any string. If I add "+" instead "(.\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected.

I feel that using "?" at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. I am probably being too naive with the ".*" sections that I use to ignore parts of the string.


Doesn't Work As Expected:

.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*
like image 546
EndangeredMassa Avatar asked Jan 28 '09 17:01

EndangeredMassa


People also ask

How do I make a group optional in regex?

So to make any group optional, we need to have to put a “?” after the pattern or group. This question mark makes the preceding group or pattern optional. This question mark is also known as a quantifier.

How do I refer a group in regex?

For instance, the regex \b(\w+)\b\s+\1\b matches repeated words, such as regex regex, because the parentheses in (\w+) capture a word to Group 1 then the back-reference \1 tells the engine to match the characters that were captured by Group 1.

What is non capturing group in regex?

Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we'll explore how to use non-capturing groups in Java Regular Expressions.


1 Answers

One possibility is that the second to last .* is being greedy. You might try changing it to:

.*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.*
                             ^ Added that

That wasn't correct, this one will match the input you supplied, but it assumes that the first . it encounters is the start of a file extension:

.*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.*

Edit: Remove the escaping I had in the second regex.

like image 133
Sean Bright Avatar answered Sep 28 '22 15:09

Sean Bright