Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Regex in Scala REPL

I want to detect words of Unicode Letters (\p{L}).

Scala's REPL gives back false for the following statement, while in Java it's true (which is the right behaviour):

java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()

Both Java and Scala are running in JRE 1.7:

System.getProperty("java.version") gives back "1.7.0_60-ea"

What could be the reason for that?

like image 318
pvorb Avatar asked Feb 17 '14 19:02

pvorb


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What is regex in Scala?

Regular Expressions explain a common pattern utilized to match a series of input data so, it is helpful in Pattern Matching in numerous programming languages. In Scala Regular Expressions are generally termed as Scala Regex.

Does JavaScript regex support Unicode?

As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.


2 Answers

Probably a non-compatible character encoding used within the interpreter. For example, here's my output:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8. Note, however, this blog post (which is a bit old) :

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. [...]


Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

like image 198
mikołak Avatar answered Sep 29 '22 10:09

mikołak


You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means you can use character classes such as '\w' to match Unicode characters like this:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.

This flag is supported starting from Java 7.

like image 26
marko Avatar answered Sep 29 '22 09:09

marko