GNU grep regex `[一-十]` (one to ten) does not match the Chinese character 四 (four)

Question

This command

$ echo '一二三四五六七八九十' | grep -oE '[一-十]'

outputs:

一
二
三
五
六
七
八
九
十

The regex [一-十] (one to ten) is expected to match against Chinese numbers. As the example shows, it matches against every Chinese number from one to ten, except the Chinese character 四 (four)。

Why?

Is this a bug or a joke?

I may think this as joke, because in Chinese '四' (four) sounds alike '事' (thing). In fact, in some dialects of Chinese, they share a same pronunciation. Thus '一二三五六七八九十' (one two three five six seven eight nine ten) implies '沒四' (no four), i.e. '沒事' (no thing)。

BTW, the version of the grep I use:

GNU grep 2.5.4

bmargulies · Accepted Answer

The Chinese numbers are not in order in Unicode. That 四 is U+56DB, while 一 is U+4E00, and 10 is 5341. So the 4 doesn't fit.

Read the Unicode standard for more information, and see http://www.unicode.org/charts/PDF/U4E00.pdf.

GNU grep regex `[一-十]` (one to ten) does not match the Chinese character 四 (four)

Tags:

regex

grep

cjk

weakish

1 Answers

bmargulies

Recent Activity

Donate For Us

GNU grep regex `[一-十]` (one to ten) does not match the Chinese character 四 (four)

Tags:

regex

grep

cjk

weakish

1 Answers

bmargulies

Related questions

Recent Activity

Donate For Us