I want to detect words of Unicode Letters (\p{L}
).
Scala's REPL gives back false
for the following statement, while in Java it's true
(which is the right behaviour):
java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
Both Java and Scala are running in JRE 1.7:
System.getProperty("java.version")
gives back "1.7.0_60-ea"
What could be the reason for that?
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
Regular Expressions explain a common pattern utilized to match a series of input data so, it is helpful in Pattern Matching in numerous programming languages. In Scala Regular Expressions are generally termed as Scala Regex.
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.
Probably a non-compatible character encoding used within the interpreter. For example, here's my output:
scala> System.getProperty("file.encoding")
res0: String = UTF-8
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true
So the solution is to run scala
with -Dfile.encoding=UTF-8
. Note, however, this blog post (which is a bit old) :
The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:
$ JAVA_OPTS="-Dfile.encoding=utf8" scala
[...] Just trying to setscala -Dfile.encoding=utf8
doesn't seem to do it. [...]
Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:
scala> println("a\u0308")
ä
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()
res1: Boolean = false
This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.
You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS
This means you can use character classes such as '\w' to match Unicode characters like this:
"(?U)\\w+".r.findFirstIn("pässi")
In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.
This flag is supported starting from Java 7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With