Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern matching with Chinese characters (encoded in UTF-8) in Java

Tags:

java

string

cjk

I need to check whether a Chinese province is contained within an address in Chinese.

I am able to read and write Chinese characters easily.

I tried to use the indexOf() method of String to check whether a province (e.g. 广东) is contained within an address (中国 广东). However, this always returns -1.

When I try to check for numbers (e.g. whether 103 is contained within 9910399) it works fine.

Do I need to do something different to handle UTF-8 string matching? Thanks. Matt

like image 694
Matt Smith Avatar asked Nov 04 '22 15:11

Matt Smith


1 Answers

I have just tried your example and although I do not have Chineese fonts on my system, so the characters are not displayed correctly indexOf() works fine for me.

So, check encoding of your source files (*.java). For example if you are using eclipse check it under Window/Preferences/General/Workspace/Text file Encoding. I am using UTF-8.

The second think is the encoding used by java compiler. In case of eclipse you do not have to say anything. I think that for javac you probably should explicitely set encoding using -encoding. Otherwise the default OS encoding will be probably used.

Good luck.

like image 154
AlexR Avatar answered Nov 12 '22 18:11

AlexR