Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I reverse a string that contains combining characters in Perl?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?

like image 417
Chas. Owens Avatar asked Aug 28 '09 14:08

Chas. Owens


3 Answers

You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:

#!/usr/bin/perl

use strict;
use warnings;

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";
like image 109
Chas. Owens Avatar answered Nov 06 '22 07:11

Chas. Owens


The best answer is to use Unicode::GCString, as Sinan points out


I modified Chas's example a bit:

  • Set the encoding on STDOUT to avoid "wide character in print" warnings;
  • Use a positive lookahead assertion (and no separator retention mode) in split (doesn't work after 5.10, apparently, so I removed it)

It's basically the same thing with a couple of tweaks.

use strict;
use warnings;

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;

print <<HERE;
original: [$original]
   wrong: [$wrong]
   right: [$right]
HERE
like image 38
brian d foy Avatar answered Nov 06 '22 05:11

brian d foy


You can use Unicode::GCString:

Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].

#!/usr/bin/env perl

use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);

use Unicode::GCString;

my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };

say "$x -> $wrong";
say "$y -> $correct";

Output:

résumé -> ́emuśer
résumé -> émusér
like image 2
Sinan Ünür Avatar answered Nov 06 '22 06:11

Sinan Ünür