Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differences in RegEx syntax between Python and Java

Tags:

java

python

regex

I have a working regex in Python and I am trying to convert to Java. It seems that there is a subtle difference in the implementations.

The RegEx is trying to match another reg ex. The RegEx in question is:

/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)

One of the strings that it is having problems on is: /\s+/;

The reg ex is not supposed to be matching the ending ;. In Python the RegEx works correctly (and does not match the ending ;, but in Java it does include the ;.

The Question(s):

  1. What can I do to get this RegEx working in Java?
  2. Based on what I read here there should be no difference for this RegEx. Is there somewhere a list of differences between the RegEx implementations in Python vs Java?
like image 235
Vineet Avatar asked May 08 '12 03:05

Vineet


2 Answers

Java doesn't parse Regular Expressions in the same way as Python for a small set of cases. In this particular case the nested ['s were causing problems. In Python you don't need to escape any nested [ but you do need to do that in Java.

The original RegEx (for Python):

/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)

The fixed RegEx (for Java and Python):

/(\\.|[^\[/\\\n]|\[(\\.|[^\]\\\n])*\])+/([gim]+\b|\B)
like image 87
Vineet Avatar answered Sep 20 '22 12:09

Vineet


The obvious difference b/w Java and Python is that in Java you need to escape a lot of characters.

Moreover, you are probably running into a mismatch between the matching methods, not a difference in the actual regex notation:

Given the Java

String regex, input; // initialized to something
Matcher matcher = Pattern.compile( regex ).matcher( input );
  • Java's matcher.matches() (also Pattern.matches( regex, input )) matches the entire string. It has no direct equivalent in Python. The same result can be achieved by using re.match( regex, input ) with a regex that ends with $.
  • Java's matcher.find() and Python's re.search( regex, input ) match any part of the string.
  • Java's matcher.lookingAt() and Python's re.match( regex, input ) match the beginning of the string.

For more details also read Java's documentation of Matcher and compare to the Python documentation.

Since you said that isn't the problem, I decided to do a test: http://ideone.com/6w61T It looks like java is doing exactly what you need it to (group 0, the entire match, doesn't contain the ;). Your problem is elsewhere.

like image 36
trutheality Avatar answered Sep 24 '22 12:09

trutheality