I have a statement where Chinese character and English character are next to each other: <pre class="prettyprint"><code>我Love Perl 6哈哈 </code></pre> I want to insert a space between Chinese character and English character: <pre class="prettyprint"><code>我 Love Perl 6 哈哈 </code></pre> I search that <code>\u4e00-\u9fa5</code> represent Chinese character: <pre class="prettyprint"><code>'哈' ~~ /<[\u4e00..\u9fa5]>/ </code></pre> but this result in: <pre class="prettyprint"><code>Potential difficulties: Repeated character (0) unexpectedly found in character class at line 2 ------> '哈' ~~ /<[\u4e00..\⏏u9fa5]>/ </code></pre> so how to match a Chinese character?

The main problem is that <code>\u</code> is not a valid escape. <pre class="prettyprint lang-none prettyprint-override"><code>> "\u4e00" ===SORRY!=== Error while compiling: Unrecognized backslash sequence: '\u' ------> "\⏏u4e00" </code></pre> <code>\x</code> is though. <pre class="prettyprint lang-none prettyprint-override"><code>> "\x4e00" 一 </code></pre> At any rate, the character class you are trying to use doesn't cover all Chinese characters. <pre class="prettyprint lang-none prettyprint-override"><code>> '㒠' ~~ /<[\x4e00..\x9fa5]>/ Nil </code></pre> What you probably want is to match on a script. <pre class="prettyprint lang-none prettyprint-override"><code>> '㒠' ~~ /<:Han>/ ｢㒠｣ </code></pre> This has the benefit that you don't have to keep changing your character class every time a new set of characters gets added to Unicode. <hr> At any rate you could do any of the following <pre class="prettyprint lang-perl6 prettyprint-override"><code># store in $0 and $1 say S/(<:Han>)(<:Latin>)/$0 $1/ given '我Love Perl 6哈哈' say S{(<:Han>)(<:Latin>)} = "$0 $1" given '我Love Perl 6哈哈' # same with subst say '我Love Perl 6哈哈'.subst: /(<:Han>)(<:Latin>)/, {"$0 $1"} # only match between the two say S/<:Han> <( )> <:Latin>/ / given '我Love Perl 6哈哈' say S{<:Han> <( )> <:Latin>} = ' ' given '我Love Perl 6哈哈' </code></pre> To change the value in a variable use <code>s///</code> or <code>.=subst</code> <pre class="prettyprint lang-perl6 prettyprint-override"><code>my $v = '我Love Perl 6哈哈'; $v ~~ s/(<:Han>)(<:Latin>)/$0 $1/; $v ~~ s{(<:Han>)(<:Latin>)} = "$0 $1"; $v ~~ s/<:Han> <()> <:Latin>/ /; $v .= subst: /(<:Han>)(<:Latin>)/, {"$0 $1"}; $v .= subst: /<:Han> <()> <:Latin>/,' '; </code></pre> Note that <code><(</code> causes everything to be ignored before it, and <code>)></code> does the same for everything after it. (can be used individually). You may want to use an inverted match instead for the character that is following. <pre class="prettyprint lang-perl6 prettyprint-override"><code>S/<:Han> <( )> [ <!:Han> & <!space> ]/ / </code></pre> (Match a character that is at the same time not Han and not a space.)

How to insert a space between Chinese character and English character?

Tags:

regex

raku

I have a statement where Chinese character and English character are next to each other:

我Love Perl 6哈哈

I want to insert a space between Chinese character and English character:

我 Love Perl 6 哈哈

I search that \u4e00-\u9fa5 represent Chinese character:

'哈' ~~ /<[\u4e00..\u9fa5]>/

but this result in:

Potential difficulties:
Repeated character (0) unexpectedly found in character class
at line 2
------> '哈' ~~ /<[\u4e00..\⏏u9fa5]>/

so how to match a Chinese character?

806

asked Jul 11 '18 13:07

chenyf

1 Answers

The main problem is that \u is not a valid escape.

> "\u4e00"
===SORRY!=== Error while compiling:
Unrecognized backslash sequence: '\u'
------> "\⏏u4e00"

\x is though.

> "\x4e00"
一

At any rate, the character class you are trying to use doesn't cover all Chinese characters.

> '㒠' ~~  /<[\x4e00..\x9fa5]>/ 
Nil

What you probably want is to match on a script.

> '㒠' ~~  /<:Han>/
｢㒠｣

This has the benefit that you don't have to keep changing your character class every time a new set of characters gets added to Unicode.

At any rate you could do any of the following

# store in $0 and $1
say S/(<:Han>)(<:Latin>)/$0 $1/ given '我Love Perl 6哈哈'
say S{(<:Han>)(<:Latin>)} = "$0 $1" given '我Love Perl 6哈哈'
# same with subst
say '我Love Perl 6哈哈'.subst: /(<:Han>)(<:Latin>)/, {"$0 $1"}

# only match between the two
say S/<:Han> <( )> <:Latin>/ / given '我Love Perl 6哈哈'
say S{<:Han> <( )> <:Latin>} = ' ' given '我Love Perl 6哈哈'

To change the value in a variable use s/// or .=subst

my $v = '我Love Perl 6哈哈';

$v ~~ s/(<:Han>)(<:Latin>)/$0 $1/;
$v ~~ s{(<:Han>)(<:Latin>)} = "$0 $1";
$v ~~ s/<:Han> <()> <:Latin>/ /;

$v .= subst: /(<:Han>)(<:Latin>)/, {"$0 $1"};
$v .= subst: /<:Han> <()> <:Latin>/,' ';

Note that <( causes everything to be ignored before it, and )> does the same for everything after it. (can be used individually).

You may want to use an inverted match instead for the character that is following.

S/<:Han> <( )> [ <!:Han> & <!space> ]/ /

(Match a character that is at the same time not Han and not a space.)

106

answered Sep 28 '22 05:09

Brad Gilbert

Related questions
                            
                                php regex to extract multiple matches from string
                            
                                PowerShell multiple string replacement efficiency
                            
                                Why does Perl backtracking match failure seem to take less time than match success?
                            
                                Bash: Remove headers from HTTP response
                            
                                Using REGEXP_EXTRACT to get domain and subdomains
                            
                                How to test if a string has Markdown in it
                            
                                How can I perform validation on a secure password. Regular expressions on a char[]?
                            
                                Use Regular Expressions in JPA CriteriaBuilder
                            
                                Ungreedy regex in C#
                            
                                Match all URLs except certain URLs in Chrome Extension
                            
                                How does negative lookahead with asterisks work?
                            
                                Java Regex validate username length
                            
                                Why regular expression ((x,y)|(x,z)) is nondeterministic?
                            
                                Regex Binary Pattern Search in PHP
                            
                                How do I make a word optional in a Cucumber step definition?
                            
                                How to remove text between multiple pairs of brackets?
                            
                                Remove characters after the last occurrence of a specific character
                            
                                Groovy: Idiomatic way to replace captured groups
                            
                                Remove everything after a character, but keep the character
                            
                                linux find files with optional character in their name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With