I've developed a regex that matches pstops page specifications. (Regex whitespace not significant.) <pre class="prettyprint"><code>^(?:(?<modulo>\d+):)? (?<pages> (?<pagespec> (?<pageno>-?\d+) (?<rotation>[RUL]?)? (?:@(?<scale>\d*(?:\.\d+)))? (?:$ (?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)? , (?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)? $)? \+?)+,? )+$ </code></pre> . <pre class="prettyprint"><code>'Sample string: '"4:1L@.7(21cm,0)+-2L@.7(21cm,14.85cm),1L(21cm,0)+-2L@.7(21cm,14.85cm)" </code></pre> As you can see, there are nested named subgroups. A <code>pagespec</code> need not specify <code>rotation</code>, for example. I would like to be able to do something to the effect of this: <pre class="prettyprint"><code>If match.Groups("pages").Captures(0).Groups("pagespecs").Captures(1).Groups("rotation").Value > "" </code></pre> but of course <code>Captures</code> has no <code>Groups</code> property. Is there any way to access subgroups in the hierarchy in this way? EDIT: Here is a more minmal example (white space significant this time): <pre class="prettyprint"><code>(?<paragraph>(?:(?<sentence>The (?<child>boy|girl) is hungry\.|The (?<parent>mother|father) is angry\.)\s*)+) </code></pre> Matched against this string: <pre class="prettyprint"><code>The boy is hungry. The mother is angry. The girl is hungry. </code></pre> produces one match. Within that match, <ul> <li> <code>Groups("paragraph")</code> has one capture matching the entire string.</li> <li> <code>Groups("sentence")</code> has three captures.</li> <li> <code>Groups("child")</code> has two captures, <code>boy</code> and <code>girl</code>.</li> <li> <code>Groups("parent")</code> has one capture, <code>mother</code>.</li> </ul> But there is nothing that tells me that the single capture for <code>parent</code> lies within the second capture for <code>sentence</code>, unless I start looking at <code>Index</code> and <code>Length</code> for each capture. EDIT: Here's the final answer: <pre class="prettyprint"><code>^(?:(?<modulo>\d+):)? (?<pages> (?<pagespec> (?<pageno>-?\d+) (?<rotation>[RUL]?) (?:@(?<scale>\d*(?:\.\d+)))? (?:$ (?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)? , (?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)? $)? (?<pageno>)(?<rotation>)(?<scale>)(?<xoff>)(?<xunit>)(?<yoff>)(?<yunit>) \+?)+,? (?<pagespec>) )+ </code></pre> This pushes a <code>NULL</code> onto the <code>pagespec</code> stack between each <code>page</code>, so they can be correlated with <code>page</code>; and a <code>NULL</code> onto each of the other named stacks between each <code>pagespec</code>. Gee, parsing is hard ...

I don't think this is possible. As far as I know, different groups have no relation to each other with regard to how they are nested in the pattern. Moreover, such a hierarchy wouldn't even make sense, because group names can be reused in .NET: <pre class="prettyprint"><code>(?<group> (?.) )+ (?.) </code></pre> I guess it would be somehow possible to represent this a hierarchical tree as well, but this would defeat the purpose of the stacks, .NET maintains for captures. Maybe I should clarify that: .NET does not simply list all the captures of a group - it pushes them onto a stack from which they can be popped again with <code>(?<-sub>)</code> for instance. Now how would you treat that, if an instance of a group later on pops something from the stack that was matched way earlier? I think it would become highly unintuitive if not impossible to solve for the general case. What you actually want is to group your <code>pagespecs</code> captures by how they correspond to a single "instance" of <code>pages</code>. You can do this by the very reason that prevents the solution, you'd like to have: you can reuse groups: <pre class="prettyprint"><code>^(?:(?<modulo>\d+):)? (?<pages> (?<pagespecs> # here goes your actual pagespec pattern [+]?)+ (?<pagespecs>) ,? )+$ </code></pre> Now at the end of every <code>page</code> you push an empty string on to the <code>pagespecs</code> stack. Since a normal "instance" of <code>pagespecs</code> will always contain at least one character, you know that any empty captures must have come from that separate use of <code>pagespecs</code>. So you can now split you <code>Captures("pagespecs")</code> by the empty-string elements, and then just associate them sequentially with the elements in <code>Captures("pages")</code>.

With nested named groups in a regex, possible to navigate hierarchy?

Tags:

.net

regex

I've developed a regex that matches pstops page specifications. (Regex whitespace not significant.)

^(?:(?<modulo>\d+):)?
(?<pages>
  (?<pagespec>
    (?<pageno>-?\d+)
    (?<rotation>[RUL]?)?
    (?:@(?<scale>\d*(?:\.\d+)))?
    (?:\(
      (?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)?
      ,
      (?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)?
    \))?
  \+?)+,?
)+$

'Sample string:
'"4:[email protected](21cm,0)[email protected](21cm,14.85cm),1L(21cm,0)[email protected](21cm,14.85cm)"

As you can see, there are nested named subgroups. A pagespec need not specify rotation, for example. I would like to be able to do something to the effect of this:

If match.Groups("pages").Captures(0).Groups("pagespecs").Captures(1).Groups("rotation").Value > ""

but of course Captures has no Groups property. Is there any way to access subgroups in the hierarchy in this way?

EDIT: Here is a more minmal example (white space significant this time):

(?<paragraph>(?:(?<sentence>The (?<child>boy|girl) is hungry\.|The (?<parent>mother|father) is angry\.)\s*)+)

Matched against this string:

The boy is hungry. The mother is angry. The girl is hungry.

produces one match. Within that match,

Groups("paragraph") has one capture matching the entire string.
Groups("sentence") has three captures.
Groups("child") has two captures, boy and girl.
Groups("parent") has one capture, mother.

But there is nothing that tells me that the single capture for parent lies within the second capture for sentence, unless I start looking at Index and Length for each capture.

EDIT: Here's the final answer:

^(?:(?<modulo>\d+):)?
(?<pages>
  (?<pagespec>
    (?<pageno>-?\d+)
    (?<rotation>[RUL]?)
    (?:@(?<scale>\d*(?:\.\d+)))?
    (?:\(
      (?<xoff>\d*\.?\d+)(?<xunit>in|cm|w|h)?
      ,
      (?<yoff>\d*\.?\d+)(?<yunit>in|cm|w|h)?
    \))?
    (?<pageno>)(?<rotation>)(?<scale>)(?<xoff>)(?<xunit>)(?<yoff>)(?<yunit>)
  \+?)+,?
 (?<pagespec>)
)+

This pushes a NULL onto the pagespec stack between each page, so they can be correlated with page; and a NULL onto each of the other named stacks between each pagespec. Gee, parsing is hard ...

545

asked Aug 16 '13 18:08

Ross Presser

1 Answers

I don't think this is possible. As far as I know, different groups have no relation to each other with regard to how they are nested in the pattern. Moreover, such a hierarchy wouldn't even make sense, because group names can be reused in .NET:

(?<group>
  (?<sub>.)
)+
(?<sub>.)

I guess it would be somehow possible to represent this a hierarchical tree as well, but this would defeat the purpose of the stacks, .NET maintains for captures. Maybe I should clarify that: .NET does not simply list all the captures of a group - it pushes them onto a stack from which they can be popped again with (?<-sub>) for instance. Now how would you treat that, if an instance of a group later on pops something from the stack that was matched way earlier? I think it would become highly unintuitive if not impossible to solve for the general case.

What you actually want is to group your pagespecs captures by how they correspond to a single "instance" of pages. You can do this by the very reason that prevents the solution, you'd like to have: you can reuse groups:

^(?:(?<modulo>\d+):)?
(?<pages>
  (?<pagespecs>
     # here goes your actual pagespec pattern
  [+]?)+
  (?<pagespecs>)
  ,?
)+$

Now at the end of every page you push an empty string on to the pagespecs stack. Since a normal "instance" of pagespecs will always contain at least one character, you know that any empty captures must have come from that separate use of pagespecs. So you can now split you Captures("pagespecs") by the empty-string elements, and then just associate them sequentially with the elements in Captures("pages").

192

answered Sep 18 '22 15:09

Martin Ender

Related questions
                            
                                Make a .tar.gz archive without folder structure using SharpZipLib
                            
                                How to determine if an HttpResponseMessage was fulfilled from cache using HttpClient
                            
                                Create LineChart using openxml
                            
                                Class usage pitfalls breaking Liskov Substitution Principle
                            
                                Create and verify x509 certificates in .Net
                            
                                System.Diagnostics Traces wildcard for source name
                            
                                Strange behavior from .NET regarding file paths
                            
                                Is it valid to use `<%= "{0}, {1}", arg1, arg2 %>` in place of `<%= string.Format("{0}, {1}", arg1, arg2) %>` in ASP.NET aspx pages
                            
                                .NET Most efficient way to replace placeholders text with actual values?
                            
                                Linq Filter row differences in historical
                            
                                The type initializer for 'Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment' threw an exception
                            
                                Returning Expression<> using various class properties
                            
                                Unwanted dependency on .NET in deployment project following upgrade to .NET 4.5
                            
                                Why is Convert.ToDecimal returning different values
                            
                                GC not finalizing UserControl?
                            
                                How do Mono and Microsoft .Net assemblies have same public key token?
                            
                                In-App purchases for Desktop applications
                            
                                Template method pattern without inheritance
                            
                                Is the Indexer of an object somehow accessible through its TypeDescriptor?
                            
                                how do you accomplish multithreading using a single thread?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With