I want to detect words of Unicode Letters (<code>\p{L}</code>). Scala's REPL gives back <code>false</code> for the following statement, while in Java it's <code>true</code> (which is the right behaviour): <code>java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()</code> Both Java and Scala are running in JRE 1.7: <code>System.getProperty("java.version")</code> gives back <code>"1.7.0_60-ea"</code> What could be the reason for that?

Probably a non-compatible character encoding used within the interpreter. For example, here's my output: <pre class="prettyprint"><code>scala> System.getProperty("file.encoding") res0: String = UTF-8 scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches() res1: Boolean = true </code></pre> So the solution is to run <code>scala</code> with <code>-Dfile.encoding=UTF-8</code>. Note, however, this blog post (which is a bit old) : <blockquote> The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application: <code>$ JAVA_OPTS="-Dfile.encoding=utf8" scala</code> [...] Just trying to set <code>scala -Dfile.encoding=utf8</code> doesn't seem to do it. [...] </blockquote> <hr> Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.: <pre class="prettyprint"><code>scala> println("a\u0308") ä scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches() res1: Boolean = false </code></pre> This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS This means you can use character classes such as '\w' to match Unicode characters like this: <pre class="prettyprint"><code>"(?U)\\w+".r.findFirstIn("pässi") </code></pre> In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp. This flag is supported starting from Java 7.

Unicode Regex in Scala REPL

2 Answers

Probably a non-compatible character encoding used within the interpreter. For example, here's my output:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8. Note, however, this blog post (which is a bit old) :

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. [...]

Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

198

answered Sep 29 '22 10:09

mikołak

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means you can use character classes such as '\w' to match Unicode characters like this:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.

This flag is supported starting from Java 7.

answered Sep 29 '22 09:09

marko

Related questions
                            
                                Prevent Code-Formatting in Eclipse by HotKey
                            
                                Java generics 2-way reference
                            
                                What happens when you have the same class twice in a war deployed in tomcat?
                            
                                Detect if given file is image and is valid image of specific type in java
                            
                                java - Cucumber on JUnit Test Failure Hook
                            
                                Java Swing: how to smoothly animate/move component
                            
                                Hibernate JDBC Batch size is not working
                            
                                How can I switch the bytecode target level in IntelliJ using Maven 3 profiles
                            
                                java memory leak, visualvm showing wrong data
                            
                                What is the Big O of a For Loop, Iterated Square Root Times?
                            
                                How to inject session scope bean in interceptor using java config with Spring
                            
                                Verify if a process is running using its PID in JAVA
                            
                                Run 64 bit Java with Internet explorer 11
                            
                                Why Lazy loading not working in one to one association?
                            
                                using UTF-8 characters in JAVA variable-names
                            
                                JAXB convert non-ASCII characters to ASCII characters
                            
                                How to "update" an existing Named Entity Recognition model - rather than creating from scratch?
                            
                                Design Pattern for generating HTML Tags
                            
                                Nested/Recursive Injections with Dagger
                            
                                Eclipse skipping lines while debugging

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode Regex in Scala REPL

Tags:

java

regex

unicode

scala

read-eval-print-loop

pvorb

People also ask

2 Answers

mikołak

marko

Recent Activity

Donate For Us