Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using sed, how can a regular expression match Chinese characters?

I decided to post a question, after spending quite some time and still not figuring out the problem. Also read a bunch of seemingly related posts, none really fit my simple (?) problem.

So I have a possibly large text file (>1000 lines) that contains Mandarin Chinese chars, with a sample line like:

"ref#2-5-1.jpg#2#一些 <variable> 内容#pic##" (the Chinese just means "some content"). 

All that needs to be modified is that a space should be inserted between each character, if there is not one already:

"ref#2-5-1.jpg#2#一 些 <variable> 内 容#pic##".

I started naively with straightforward stuff like the following, but there is no match at all:

sed -e 's/\([\u4E00-\u9fff]\)/\1 /g' <test_utf_sed.txt > test_out.txt

where 4E00-9fff are supposed to be the code range for Mandarin Chinese. Unamazingly, this has not worked, so I also had wanted to try

sed -e 's/\([一-龻]\)/hello/g' <test_utf_sed.txt > test_out.txt

This failed because my bash cannot display (?) the "一" character.

Then I did some basic test, which failed as well:

sed -e 's/\(\u4E00\)/hello/g' <test_utf_sed.txt > test_out.txt //一
sed -e 's/\(\u4E9B\)/hello/g' <test_utf_sed.txt > test_out.txt //些

Same with another notation for utf encoding (found here on stackoverflow):

sed -e 's/\(\u'U+4E00\)/hello/g' <test_utf_sed.txt > test_out.txt

1) As tool for dealing with double byte chars, is sed the right choice at all?

2) Is sed able to handle unicode at all, or do I need a special switch?

3) I am not looking for a workaround solution like this:

step1: insert space after each character 
  //like 's/\(.\)/\1 /g')
step2: remove space after each chacter which is not a Chinese character 
  //like 's/\([a-zA-Z0-9]\) /\1/g')

I know how to do this but it is unelegant and error-prone. This must be possible using utf-8 in regex in sed.

4) My environment is bash-3.2 on a MacOS 10.6.8 (oldish OS).

5) If you know of any pointers to some open regEx-onliners as library dealing with Chinese text or language processing, it would be great to share.

Thanks a lot in advance, your help is much appreciated!

like image 859
sweetnsour Avatar asked Feb 14 '23 04:02

sweetnsour


1 Answers

Perl has pretty good support for dealing with Unicode. That might be a better bet for your task than sed. This one-liner works like your first sed example:

perl -CIOED -p -e 's/\p{Script_Extensions=Han}/$& /g' filename

The -CIOED tells perl to do its I/O in utf8. -p runs the given code once for each line of the input file, then prints the result. -e specifies a line of Perl code to run. See the documentation on command-line arguments for more.

The regular expression uses named ranges to identify the characters to match.

You might also want to read the Perl Unicode documentation.

like image 161
Evan Avatar answered Apr 28 '23 00:04

Evan