Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

does java support if-then-else regexp constructs(Perl constructs)?

Tags:

java

regex

perl

I receive PatternSyntaxException when try to compile the following regex:

"bd".matches("(a)?b(?(1)c|d)")

this regex matches bd and abc. It does not match bc.

any ideas? thanks.

ok i need to write regex to match next 4 strings:

*date date* date date1*date2

should not match:

*date* date1*date2* *date1*date2 date** ...

but this should be done with single matching, not several.

please do not post answer like:

(date*date)|(*date)|(date*)|(date)
like image 337
r.r Avatar asked Nov 09 '11 22:11

r.r


1 Answers

Imagine if you can a language that lacked an else statement, but you wanted to emulate it. Instead of writing

if (condition) { yes part }
else           { no part  }

You would have to write

if (condition)   { yes part }
if (!condition)  { no part  }

Well, that’s what you have to do here, but in the pattern. What you do in Java without conditionals is you repeat the condition, but negate it, in the ELSE block, which is actually an OR block.

So for example, instead of writing this in a language like Perl with conditional support in pattern:

# definition of \b using a conditional in the pattern like Perl
#
(?(?<=      \w)     # if there is a word character to the left
      (?!   \w)     #    then there must be no word character to the right
  |   (?=   \w)     #    else there must be a  word character to the right
)

You must in Java write:

# definition of \b using a duplicated condition like Java
#
(?:   (?<=  \w)     # if there is a word character to the left
      (?!   \w)     #    then there must be no word character to the right
  |                 # ...otherwise...
      (?<!  \w)     # if there is no word character to the left
      (?=   \w)     #    then there must be a word character to the right
)

You may recognize that as being the definition of \b. Here then similarly for \B’s definition, first using conditionals:

# definition of \B using a conditional in the pattern like Perl
#
(?(?<=      \w)     # if there is a word character to the left
      (?=   \w)     #    then there must be a  word character to the right
  |   (?!   \w)     #    else there must be no word character to the right
)

And now by repeating the (now negated) condition in the OR branch:

# definition of \B using a duplicated condition like Java
#
(?:   (?<=  \w)     # if there is a word character to the left
      (?!   \w)     #    then there must be no word character to the right
  |                 # ...otherwise...
      (?<!  \w)     # if there is no word character to the left
      (?=   \w)     #    then there must be a word character to the right
)

Notice how not matter how you roll them, that the respective definitions of \b and \B alike rest solely on the definition of \w, never on \W, let alone on \s.

Being able to use conditionals not only saves typing, it also reduces the chance of doing it wrong. They may also be occasions where you do not care to evaluate the condition twice.

Here I make use of that to define several regex subroutines that provide me with a Greeklish atom and boundaries for the same:

(?(DEFINE)
    (?<greeklish>            [\p{Greek}\p{Inherited}]   )
    (?<ungreeklish>          [^\p{Greek}\p{Inherited}]  )
    (?<greek_boundary>
        (?(?<=      (?&greeklish))
              (?!   (?&greeklish))
          |   (?=   (?&greeklish))
        )
    )
    (?<greek_nonboundary>
        (?(?<=      (?&greeklish))
              (?=   (?&greeklish))
          |   (?!   (?&greeklish))
        )
    )
)

Notice how the boundary and nonboundaries use only (&?greeklish), never (?&ungreeklish)? You don’t ever need the non-whatever just to do boundaries. You put the not into your lookarounds instead, just as \b and \B both do.

Although in Perl it’s probably easier (albeit less general) just to define a new, custom property, \p{IsGreeklish} (and hence its complement \P{IsGreeklish}):

 sub IsGreeklish {
     return <<'END';
 +utf8::IsGreek
 +utf8::IsInherited
 END
 }

You won’t be able to translate either of those into Java though, albeit not so much because of Java’s lack of support for conditionals, but rather because its pattern language doesn’t allow (DEFINE) blocks or regex subroutine calls like (?&greeklish) — and indeed, your patterns cannot even recurse in Java. Nor can you in Java define custom properties like \p{IsGreeklish}.

And of course conditionals in Perl regexes can be more than lookarounds: they can even be code blocks to execute — which is why you certainly don’t want to be forced to evaluate the same condition twice, lest it have side-effects. That doesn’t apply to Java, because it can’t do that. You can’t intermix pattern and code, which limits you more than you might think before you get in the habit of doing so.

There are really a huge whole lot of things you can do with the Perl regex engine that you can do in no other language, and this is just some of that. It’s no wonder that the greatly expanded Regexes chapter in the new 4th edition of Programming Perl, when coupled with the completely rewritten Unicode chapter which now immediately follows the Regexes chapter (having been promoted into part of the inner core), have a combined page count of something like 130 pages, so double the length of the old chapter on pattern matching from the 3rd edition.

What you’ve just seen above is part of the new 4th edition, which should be in print next month or so.

like image 64
tchrist Avatar answered Sep 24 '22 02:09

tchrist