I have the string "re\x{0301}sume\x{0301}"
(which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r"
(émusér). I can't use Perl's reverse
because it treats combining characters like "\x{0301}"
as separate characters, so I wind up getting "\x{0301}emus\x{0301}er"
( ́emuśer). How can I reverse the string, but still respect the combining characters?
You can use the \X special escape (match a non-combining character and all of the following combining characters) with split
to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join
them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";
The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
split
(doesn't work after 5.10, apparently, so I removed it)It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE
You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer résumé -> émusér
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With