Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional match without false force a match?

Tags:

c#

regex

I'm using the following regex in c# to match some input cases:

^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$

The options are ignoring pattern whitespaces.

My input looks as follows:

hello
#world
[xxx]

This all can be tested here: DEMO

My problem is that this regex will not match the last line. Why? What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.

This is a simplyfied regex and simplyfied input.

The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).

I try to understand why the conditional group doesn't match as stated in original regex.

I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:

^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$

That's the reason why I'm trying to use a conditional match.

UPDATE 10/12/2018

I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:

(?(a)a).*

DEMO

I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

like image 615
Sebastian Schumann Avatar asked Nov 17 '22 00:11

Sebastian Schumann


1 Answers

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])

If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said

Singline tells the parser to handle the . to match all characters including the \n.


Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.


Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.

What is matched before singleline

Notice the second match (as index 1) has world in group capture id and value as .

I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.

Let us turn on Singline and see what happens.

enter image description here

Now everything is consumed, but there is a different problem. :-)

like image 100
ΩmegaMan Avatar answered Dec 28 '22 02:12

ΩmegaMan