Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I translate this Perl regular expression into Java?

Tags:

java

regex

perl

How would you translate this Perl regex into Java?

/pattern/i

While compiles, it does not match "PattErn" for me, it fails

Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");

System.out.println(m.matches()); // prints "false"
like image 987
James Raitsev Avatar asked Dec 06 '22 19:12

James Raitsev


1 Answers

How would you translate this Perl regex into Java?

/pattern/i

You can't.

There are a lot of reasons for this. Here are a few:

  • Java doesn't support as expressive a regex language as Perl does. It lacks grapheme support (like \X) and full property support (like \p{Sentence_Break=SContinue}), is missing Unicode named characters, doesn't have a (?|...|...|) branch reset operator, doesn’t have named capture groups or a logical \x{...} escape before Java 7, has no recursive regexes, etc etc etc. I could write a book on what Java is missing here: Get used to going back to a very primitive and awkward to use regex engine compared with what you’re used to.

  • Another even worse problem is because you have lookalike faux amis like \w and and \b and \s, and even \p{alpha} and \p{lower}, which behave differently in Java compared with Perl; in some cases the Java versions are completely unusable and buggy. That’s because Perl follows UTS#18 but before Java 7, Java did not. You must add the UNICODE_CHARACTER_CLASSES flag from Java 7 to get these to stop being broken. If you can’t use Java 7, give up now, because Java had many many many other Unicode bugs before Java 7 and it just isn’t worth the pain of dealing with them.

  • Java handles linebreaks via ^ and $ and ., but Perl expects Unicode linebreaks to be \R. You should look at UNIX_LINES to understand what is going on there.

  • Java does not by default apply any Unicode casefolding whatsoever. Make sure to add the UNICODE_CASE flag to your compilation. Otherwise you won’t get things like the various Greek sigmas all matching one another.

  • Finally, it is different because at best Java only does simple casefolding, while Perl always does full casefolding. That means that you won’t get \xDF to match "SS" case insensitively in Java, and similar related issues.

In summary, the closest you can get is to compile with the flags

 CASE_INSENSITIVE | UNICODE_CASE | UNICODE_CHARACTER_CLASSES

which is equivalent to an embedded "(?iuU)" in the pattern string.

And remember that match in Java doesn’t mean match, perversely enough.


EDIT

And here’s the rest of the story...

While compiles, it does not match "PattErn" for me, it fails

   Pattern p = Pattern.compile("/pattern/i");
   Matcher m = p.matcher("PattErn");
   System.out.println(m.matches()); // prints "false"

You shouldn’t have slashes around the pattern.

The best you can do is to translate

$line = "I have your PaTTerN right here";
if ($line =~ /pattern/i) {
    print "matched.\n";
}

this way

import java.util.regex.*;

String line     = "I have your PaTTerN right here";
String pattern  = "pattern";      
Pattern regcomp = Pattern.compile(pattern, CASE_INSENSITIVE
                                        | UNICODE_CASE
                // comment next line out for legacy Java \b\w\s breakage 
                                        | UNICODE_CHARACTER_CLASSES  
                                );    
Matcher regexec = regcomp.matcher(line);    
if (regexec.find()) {
    System.out.println("matched");
} 

There, see how much easier that isn’t? :)

like image 185
tchrist Avatar answered Dec 10 '22 11:12

tchrist