Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to insert a space between Chinese character and English character?

Tags:

regex

raku

I have a statement where Chinese character and English character are next to each other:

我Love Perl 6哈哈

I want to insert a space between Chinese character and English character:

我 Love Perl 6 哈哈

I search that \u4e00-\u9fa5 represent Chinese character:

'哈' ~~ /<[\u4e00..\u9fa5]>/

but this result in:

Potential difficulties:
Repeated character (0) unexpectedly found in character class
at line 2
------> '哈' ~~ /<[\u4e00..\⏏u9fa5]>/

so how to match a Chinese character?

like image 806
chenyf Avatar asked Jul 11 '18 13:07

chenyf


People also ask

How do you space a Chinese character?

Unlike English (and other alphabetic writing systems), Chinese is written without spaces between successive characters and words.

Where do you put spaces in Chinese?

In standard pinyin, you put the spaces in between words.

How do you separate words in Chinese?

Chinese does not use a word separator. How a novice learner can detect the word's boundaries? You can check for verbs, adjectives, adverbs, nouns, conjunctions etc. that you know, there are some that are very common (eg 去 [v],好 [adj],很 [adv],昨天 [n].)


1 Answers

The main problem is that \u is not a valid escape.

> "\u4e00"
===SORRY!=== Error while compiling:
Unrecognized backslash sequence: '\u'
------> "\⏏u4e00"

\x is though.

> "\x4e00"
一

At any rate, the character class you are trying to use doesn't cover all Chinese characters.

> '㒠' ~~  /<[\x4e00..\x9fa5]>/ 
Nil

What you probably want is to match on a script.

> '㒠' ~~  /<:Han>/
「㒠」

This has the benefit that you don't have to keep changing your character class every time a new set of characters gets added to Unicode.


At any rate you could do any of the following

# store in $0 and $1
say S/(<:Han>)(<:Latin>)/$0 $1/ given '我Love Perl 6哈哈'
say S{(<:Han>)(<:Latin>)} = "$0 $1" given '我Love Perl 6哈哈'
# same with subst
say '我Love Perl 6哈哈'.subst: /(<:Han>)(<:Latin>)/, {"$0 $1"}

# only match between the two
say S/<:Han> <( )> <:Latin>/ / given '我Love Perl 6哈哈'
say S{<:Han> <( )> <:Latin>} = ' ' given '我Love Perl 6哈哈'

To change the value in a variable use s/// or .=subst

my $v = '我Love Perl 6哈哈';

$v ~~ s/(<:Han>)(<:Latin>)/$0 $1/;
$v ~~ s{(<:Han>)(<:Latin>)} = "$0 $1";
$v ~~ s/<:Han> <()> <:Latin>/ /;

$v .= subst: /(<:Han>)(<:Latin>)/, {"$0 $1"};
$v .= subst: /<:Han> <()> <:Latin>/,' ';

Note that <( causes everything to be ignored before it, and )> does the same for everything after it. (can be used individually).

You may want to use an inverted match instead for the character that is following.

S/<:Han> <( )> [ <!:Han> & <!space> ]/ /

(Match a character that is at the same time not Han and not a space.)

like image 106
Brad Gilbert Avatar answered Sep 28 '22 05:09

Brad Gilbert