Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match Java String

Tags:

java

regex

scala

In Scala's parser combinators (JavaTokensParser in particular) there is a definition stringLiteral that matches a Java-like string.

def stringLiteral: Parser[String] =
              ("\""+"""([^"\p{Cntrl}\\]|\\[\\'"bfnrt]|\\u[a-fA-F0-9]{4})*"""+"\"").r

Unfortunately, this regex does not work for long strings. Does anyone know of a re-usable implementation that does, or a modification to the regex that is more space efficient?

like image 564
schmmd Avatar asked Oct 07 '13 19:10

schmmd


1 Answers

Interesting problem!!

Just played around with this, and came up with the following:

val r = ("\"" + "(?:[^\"\\p{Cntrl}\\\\]*|(?:\\\\(?:[\\\\'\"bfnrt]|u[a-fA-F0-9]{4}))*)*" + "\"").r

Note: the above regex was fixed from the first version… both the leading '\' and the trailing characters need to be repeated, not just the trailing characters as I originally had it!

Edit: Found a more efficient regular expression. Using the following, it can parse a string of up to 950 \\ns pairs, as opposed to the original which could only parse 556, at least in my default configuration.

val r = ("\"" + "(?:[^\"\\p{Cntrl}\\\\]*|\\\\[\\\\'\"bfnrt]|\\\\u[a-fA-F0-9]{4})*" + "\"").r

Edit 2: Based on the comment from @schmmd, I have an even better regular expression. This one can parse the 2500 \ns torture case. The secret is to use the greedy possessive modifier, this basically turns off the need to backtrack, and hence also turns off the recursion.

val r = (""""([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"""").r

The essence of the solution is to try and chew as much as you can each time that you match something.

scala> val r = (""""([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"""").r
r: scala.util.matching.Regex = "([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"

scala> r.pattern.matcher("\"" + "\\ns" * 2500 + "\"").lookingAt
res4: Boolean = true

scala> r.pattern.matcher("\"" + "s" * 2500 + "\"").lookingAt
res5: Boolean = true

Update: A pull request was submitted to the scala folks. And it was accepted.

like image 177
Paul Wagland Avatar answered Oct 06 '22 07:10

Paul Wagland