Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regular Expression

Tags:

java

regex

{
Main Block
     {
     Nested Block
     }
}
{
Main Block 
     {
     Nested Block
     }
     {
     Nested Block
     }
}

I want to get data within Main Blocks including its Nested Blocks with Java Regex. Is it possible?

Thanks in Advance

like image 710
Novice Avatar asked Aug 18 '10 21:08

Novice


2 Answers

A regular expression probably isn't the best tool for the job (since it appears that you can have arbitrarily-nested braces). I think you might be better off writing a parser based on some grammar (that you'll have to define).

Here is an EBNF to get you started; it's incomplete because I don't know what things can be inside your block (other than more blocks):

blocks        ::= { block }
block         ::= "{", block-content, "}"
block-content ::= blocks | things-other-than-blocks

For some resources on parsing, take a look at this answer.

like image 69
Vivin Paliath Avatar answered Oct 10 '22 03:10

Vivin Paliath


IF there can only be at most 1 level of nesting, and the braces characters can not be escaped, then in fact the regex pattern for this is quite simple.

Essentially the structure we have, in some abstract notation, is:

{…(?:{…}…)*…}

Here's a visual breakdown:

  ___top___
 /   nest  \
/    / \    \
{…(?:{…}…)*…}
| \______/| |
|         | |
open      | close
          |
     zero or more

This is not quite regex, of course, because:

  • In "real" regex, we must escape the { and }, since they're metacharacters
  • In "real" regex, we need to replace with the actual pattern for content
    • [^{}]*+ would be a fine pattern. The […] is a character class. [^…] is a negated character class. The * is zero-or-more repetition. The + following the repetition specifier is the possessive quantifier.

So, meta-regexing technique is used to programmatically transform this abstract pattern (which is readable) to valid regex pattern (which can be ugly at times like this). Here's an example (also see on ideone.com):

    import java.util.*;
    import java.util.regex.*;
    //...

    Pattern block = Pattern.compile(
        "{…(?:{…}…)*…}"
            .replaceAll("[{}]", "\\\\$0")
            .replace("…", "[^{}]*+")
    );
    System.out.println(block.pattern());
    // \{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

    String text
        = "{ main1 { sub1a } { sub1b } { sub1c } }\n"
        + "{ main2\n"
        + "   { sub2a }\n"
        + "       { sub2c }\n"
        + "}"
        + "   { last one, promise }    ";

    Matcher m = block.matcher(text);
    while (m.find()) {
        System.out.printf(">>> %s <<<%n", m.group());
    }
    // >>> { main1 { sub1a } { sub1b } { sub1c } } <<<
    // >>> { main2
    //    { sub2a }
    //        { sub2c }
    // } <<<
    // >>> { last one, promise } <<<        

As you can see, the actual regex pattern is therefore:

\{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

Which as a Java string literal:

"\\{[^{}]*+(?:\\{[^{}]*+\\}[^{}]*+)*[^{}]*+\\}"

Variations

If the nesting level can be deeper, then regex can still be used. You can also allow the { and } to be "escaped" (i.e. used in the content part but not as block delimiter).

The final regex pattern will be quite complicated, but depending on how comfortable you are with meta-regexing (which requires you to be comfortable with regex itself), the code can be quite readable and manageable.

If the nesting level can be arbitrarily deep, then some flavors (e.g. .NET or Perl) can still handle it, but Java regex is not powerful enough to handle it.

like image 20
polygenelubricants Avatar answered Oct 10 '22 03:10

polygenelubricants