Using sed, how can a regular expression match Chinese characters?

Question

I decided to post a question, after spending quite some time and still not figuring out the problem. Also read a bunch of seemingly related posts, none really fit my simple (?) problem.

So I have a possibly large text file (>1000 lines) that contains Mandarin Chinese chars, with a sample line like:

"ref#2-5-1.jpg#2#一些 <variable> 内容#pic##" (the Chinese just means "some content").

All that needs to be modified is that a space should be inserted between each character, if there is not one already:

"ref#2-5-1.jpg#2#一 些 <variable> 内 容#pic##".

I started naively with straightforward stuff like the following, but there is no match at all:

sed -e 's/$[\u4E00-\u9fff]$/\1 /g' <test_utf_sed.txt > test_out.txt

where 4E00-9fff are supposed to be the code range for Mandarin Chinese. Unamazingly, this has not worked, so I also had wanted to try

sed -e 's/$[一-龻]$/hello/g' <test_utf_sed.txt > test_out.txt

This failed because my bash cannot display (?) the "一" character.

Then I did some basic test, which failed as well:

sed -e 's/$\u4E00$/hello/g' <test_utf_sed.txt > test_out.txt //一
sed -e 's/$\u4E9B$/hello/g' <test_utf_sed.txt > test_out.txt //些

Same with another notation for utf encoding (found here on stackoverflow):

sed -e 's/$\u'U+4E00$/hello/g' <test_utf_sed.txt > test_out.txt

1) As tool for dealing with double byte chars, is sed the right choice at all?

2) Is sed able to handle unicode at all, or do I need a special switch?

3) I am not looking for a workaround solution like this:

step1: insert space after each character 
  //like 's/$.$/\1 /g')
step2: remove space after each chacter which is not a Chinese character 
  //like 's/$[a-zA-Z0-9]$ /\1/g')

I know how to do this but it is unelegant and error-prone. This must be possible using utf-8 in regex in sed.

4) My environment is bash-3.2 on a MacOS 10.6.8 (oldish OS).

5) If you know of any pointers to some open regEx-onliners as library dealing with Chinese text or language processing, it would be great to share.

Thanks a lot in advance, your help is much appreciated!

Evan · Accepted Answer

Perl has pretty good support for dealing with Unicode. That might be a better bet for your task than sed. This one-liner works like your first sed example:

perl -CIOED -p -e 's/\p{Script_Extensions=Han}/$& /g' filename

The -CIOED tells perl to do its I/O in utf8. -p runs the given code once for each line of the input file, then prints the result. -e specifies a line of Perl code to run. See the documentation on command-line arguments for more.

The regular expression uses named ranges to identify the characters to match.

You might also want to read the Perl Unicode documentation.

Using sed, how can a regular expression match Chinese characters?

Tags:

regex

bash

sed

utf-8

chinese-locale

sweetnsour

1 Answers

Evan

Recent Activity

Donate For Us

Using sed, how can a regular expression match Chinese characters?

Tags:

regex

bash

sed

utf-8

chinese-locale

sweetnsour

1 Answers

Evan

Related questions

Recent Activity

Donate For Us