Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match a string with only Chinese letters using a regex?

I want to get a regex which can only match a string consisted of Chinese character and without English or any other character. [\u4e00-\u9fa5] doesn't work at all, and [^x00-xff] would match the situation with punctuate or other language character.

boost::wregex reg(L"\\w*");
bool b = boost::regex_match(L"我a", reg);    // expected to be false
b = boost::regex_match(L"我,", reg);         // expected to be false
b = boost::regex_match(L"我", reg);          // expected to be true
like image 930
magicyang Avatar asked Mar 29 '13 07:03

magicyang


People also ask

How do I write a regex that matches only letters?

How can I write a regex that matches only letters? Answer 1 Use a character set: [a-zA-Z] matches one letter from A–Z in lowercase and uppercase. [a-zA-Z]+ matches one or more letters and ^ [a-zA-Z]+$ matches only strings that consist of one or more letters only (^ and $ mark the begin and end of a string respectively).

How do you match letters in a string in Python?

Use a character set: [a-zA-Z] matches one letter from A–Z in lowercase and uppercase. [a-zA-Z]+ matches one or more letters and ^[a-zA-Z]+$ matches only strings that consist of one or more letters only (^ and $ mark the begin and end of a string respectively).

How do you match only a given set of characters?

In regex, we can match any character using period ".". character. To match only a given set of characters, we should use character classes. 1. Match any character using regex. '.' character will match any character without regard to what character it is. The matched character can be an alphabet, number of any special character.

How to match patterns with Chinese characters in C++?

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.


2 Answers

Boost with ICU can use character classes. I think you're looking for \p{Han} script. Alternatively, U+4E00..U+9FFF is \p{InCJK_Unified_Ideographs}

like image 183
MSalters Avatar answered Oct 15 '22 02:10

MSalters


The following regex works fine.

boost::wregex reg(L"^[\u4e00-\u9fa5]+");
like image 43
magicyang Avatar answered Oct 15 '22 01:10

magicyang