Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use regular expression to match ANY Chinese character in utf-8 encoding

For example, I want to match a string consisting of m to n Chinese characters, then I can use:

[single Chinese character regular expression]{m,n} 

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

like image 501
xiaohan2012 Avatar asked Mar 06 '12 00:03

xiaohan2012


People also ask

Can UTF-8 handle Chinese characters?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

Does regex work for other languages?

Short answer: yes.

Is Simplified Chinese UTF-8?

Simplified Chinese in the Solaris 8 environment provides three locales: zh, zh. UTF-8, and zh. GBK.

What does '$' mean in regex?

Literal Characters and Sequences For instance, you might need to search for a dollar sign ("$") as part of a price list, or in a computer program as part of a variable name. Since the dollar sign is a metacharacter which means "end of line" in regex, you must escape it with a backslash to use it literally.


2 Answers

The regex to match a Chinese (well, CJK) character is

\p{script=Han} 

which can be appreviated to simply

\p{Han} 

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

like image 92
tchrist Avatar answered Sep 21 '22 17:09

tchrist


In Java,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3} 
like image 41
DayDayHappy Avatar answered Sep 20 '22 17:09

DayDayHappy