Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl- trim utf8 bytes to 'length' and sanitize the data

Tags:

utf-8

perl

I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.

e.g

$b="\x{263a}\x{263b}\x{263c}";
my $sstr;

print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}

#After this $sstr contains "\342\230\272\342"\0 
# How to remove \342 from the end
like image 510
kaykay Avatar asked Dec 27 '22 23:12

kaykay


1 Answers

UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.

use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);

Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before

[\x00-\x7F\xC0-\xFF]

Together, we get:

use Encode qw( encode_utf8 );

my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}";  # ☺☻☼

my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;

# $bytes contains encode_utf8("\x{263a}\x{263b}")
#      instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"

Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)

Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.

my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg";  # abcdéfg

my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
   my $grapheme = $1;
   my $grapheme_bytes = encode_utf8($grapheme);
   $bytes_left -= length($grapheme_bytes);
   last if $bytes_left < 0;
   $bytes .= $grapheme_bytes;
}

# $bytes contains encode_utf8("abcd")
#      instead of encode_utf8("abcde")
#              or encode_utf8("abcde") . "\xCC"
like image 188
ikegami Avatar answered Jan 05 '23 15:01

ikegami