I've developed a regex that matches pstops page specifications. (Regex whitespace not significant.)
^(?:(?<modulo>\d+):)?
(?<pages>
(?<pagespec>
(?<pageno>-?\d+)
(?<rotation>[RUL]?)?
(?:@(?<scale>\d*(?:\.\d+)))?
(?:\(
(?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)?
,
(?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)?
\))?
\+?)+,?
)+$
.
'Sample string:
'"4:[email protected](21cm,0)[email protected](21cm,14.85cm),1L(21cm,0)[email protected](21cm,14.85cm)"
As you can see, there are nested named subgroups. A pagespec
need not specify rotation
, for example. I would like to be able to do something to the effect of this:
If match.Groups("pages").Captures(0).Groups("pagespecs").Captures(1).Groups("rotation").Value > ""
but of course Captures
has no Groups
property. Is there any way to access subgroups in the hierarchy in this way?
EDIT: Here is a more minmal example (white space significant this time):
(?<paragraph>(?:(?<sentence>The (?<child>boy|girl) is hungry\.|The (?<parent>mother|father) is angry\.)\s*)+)
Matched against this string:
The boy is hungry. The mother is angry. The girl is hungry.
produces one match. Within that match,
Groups("paragraph")
has one capture matching the entire string.Groups("sentence")
has three captures.Groups("child")
has two captures, boy
and girl
.Groups("parent")
has one capture, mother
.But there is nothing that tells me that the single capture for parent
lies within the second capture for sentence
, unless I start looking at Index
and Length
for each capture.
EDIT: Here's the final answer:
^(?:(?<modulo>\d+):)?
(?<pages>
(?<pagespec>
(?<pageno>-?\d+)
(?<rotation>[RUL]?)
(?:@(?<scale>\d*(?:\.\d+)))?
(?:\(
(?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)?
,
(?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)?
\))?
(?<pageno>)(?<rotation>)(?<scale>)(?<xoff>)(?<xunit>)(?<yoff>)(?<yunit>)
\+?)+,?
(?<pagespec>)
)+
This pushes a NULL
onto the pagespec
stack between each page
, so they can be correlated with page
; and a NULL
onto each of the other named stacks between each pagespec
. Gee, parsing is hard ...
NET RegEx engine is capable of matching nested constructions. Let's see some code. Now, let's say that the task is to match nested <div> 's in HTML code.
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
I don't think this is possible. As far as I know, different groups have no relation to each other with regard to how they are nested in the pattern. Moreover, such a hierarchy wouldn't even make sense, because group names can be reused in .NET:
(?<group>
(?<sub>.)
)+
(?<sub>.)
I guess it would be somehow possible to represent this a hierarchical tree as well, but this would defeat the purpose of the stacks, .NET maintains for captures. Maybe I should clarify that: .NET does not simply list all the captures of a group - it pushes them onto a stack from which they can be popped again with (?<-sub>)
for instance. Now how would you treat that, if an instance of a group later on pops something from the stack that was matched way earlier? I think it would become highly unintuitive if not impossible to solve for the general case.
What you actually want is to group your pagespecs
captures by how they correspond to a single "instance" of pages
. You can do this by the very reason that prevents the solution, you'd like to have: you can reuse groups:
^(?:(?<modulo>\d+):)?
(?<pages>
(?<pagespecs>
# here goes your actual pagespec pattern
[+]?)+
(?<pagespecs>)
,?
)+$
Now at the end of every page
you push an empty string on to the pagespecs
stack. Since a normal "instance" of pagespecs
will always contain at least one character, you know that any empty captures must have come from that separate use of pagespecs
. So you can now split you Captures("pagespecs")
by the empty-string elements, and then just associate them sequentially with the elements in Captures("pages")
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With