Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create an invalid UTF8 perl string?

Tags:

unicode

perl

What's a good way to create a perl string with the UTF8 flag set but contains an invalid UTF8 byte sequence?

Is there a way to set the UTF8 flag on a perl string without performing the native encoding to UTF-X translation (for instance, which happens when you call utf8::upgrade)?

I need to do this to track down a possible bug in the DBI driver.

like image 839
ErikR Avatar asked May 09 '13 17:05

ErikR


People also ask

What is an invalid UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.

How do I encode a string in Perl?

$octets = encode_utf8($string); Equivalent to $octets = encode("utf8", $string); The characters that comprise $string are encoded in Perl's internal format and the result is returned as a sequence of octets. All possible characters have a UTF-8 representation so this function cannot fail.

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What is valid UTF-8?

Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.


2 Answers

You can set an arbitrary sequence of bytes with the UTF8 flag still set by hacking at the guts of a string.

use Inline C;
use Devel::Peek;
utf8::upgrade( $str = "" );
Dump($str); 
twiddle($str, "\x{BD}\x{BE}\x{BF}\x{C0}\x{C1}\x{C2}");
Dump($str);
__DATA__
__C__
/** append arbitrary bytes to a Perl scalar **/
void twiddle(SV *s, const char *t)
{
  sv_catpv(s, t);
}

Typical output:

SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 ""\0 [UTF8 ""]
  CUR = 0
  LEN = 12
SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 "\275\276\277\300\301\302"\0Malformed UTF-8 character (unexpected continuation byte 0xbd, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbe, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbf, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0xc1, immediately after start byte 0xc0) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at ./invalidUTF.pl line 6.
 [UTF8 "\x{0}\x{0}\x{0}\x{0}\x{0}"]
  CUR = 6
  LEN = 12
like image 184
mob Avatar answered Sep 18 '22 23:09

mob


That's exactly what Encode's _utf8_on does.

use Encode qw( _utf8_on );

my $s = "abc\xC0def";  # String to use as raw buffer content.
utf8::downgrade($s);   # Make sure each char is stored as a byte.
_utf8_on($s);          # Set UTF8 flag.

(Never use _utf8_on except when you want to generate a bad scalar.)

You can view the damage using

use Devel::Peek qw( Dump );
Dump($s);

Output:

SV = PV(0x24899c) at 0x4a9294
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x24ab04 "abc\300def"\0Malformed UTF-8 character (unexpected non-continuation byte 0x64, immediately after start byte 0xc0) in subroutine entry at script.pl line 9.
 [UTF8 "abc\x{0}ef"]
  CUR = 7
  LEN = 12
like image 20
ikegami Avatar answered Sep 20 '22 23:09

ikegami