Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GNU grep regex `[一-十]` (one to ten) does not match the Chinese character 四 (four)

Tags:

regex

grep

cjk

This command

$ echo '一二三四五六七八九十' | grep -oE '[一-十]'

outputs:

一
二
三
五
六
七
八
九
十

The regex [一-十] (one to ten) is expected to match against Chinese numbers. As the example shows, it matches against every Chinese number from one to ten, except the Chinese character (four)。

Why?

Is this a bug or a joke?

I may think this as joke, because in Chinese '四' (four) sounds alike '事' (thing). In fact, in some dialects of Chinese, they share a same pronunciation. Thus '一二三五六七八九十' (one two three five six seven eight nine ten) implies '沒四' (no four), i.e. '沒事' (no thing)。

BTW, the version of the grep I use:

GNU grep 2.5.4
like image 273
weakish Avatar asked Sep 29 '12 12:09

weakish


1 Answers

The Chinese numbers are not in order in Unicode. That 四 is U+56DB, while 一 is U+4E00, and 10 is 5341. So the 4 doesn't fit.

Read the Unicode standard for more information, and see http://www.unicode.org/charts/PDF/U4E00.pdf.

like image 156
bmargulies Avatar answered Oct 30 '22 23:10

bmargulies