What's a good way to create a perl string with the UTF8 flag set but contains an invalid UTF8 byte sequence? Is there a way to set the UTF8 flag on a perl string without performing the native encoding to UTF-X translation (for instance, which happens when you call <code>utf8::upgrade</code>)? I need to do this to track down a possible bug in the DBI driver.

That's exactly what Encode's <code>_utf8_on</code> does. <pre class="prettyprint"><code>use Encode qw( _utf8_on ); my $s = "abc\xC0def"; # String to use as raw buffer content. utf8::downgrade($s); # Make sure each char is stored as a byte. _utf8_on($s); # Set UTF8 flag. </code></pre> (Never use <code>_utf8_on</code> except when you want to generate a bad scalar.) You can view the damage using <pre class="prettyprint"><code>use Devel::Peek qw( Dump ); Dump($s); </code></pre> Output: <pre class="prettyprint"><code>SV = PV(0x24899c) at 0x4a9294 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x24ab04 "abc\300def"\0Malformed UTF-8 character (unexpected non-continuation byte 0x64, immediately after start byte 0xc0) in subroutine entry at script.pl line 9. [UTF8 "abc\x{0}ef"] CUR = 7 LEN = 12 </code></pre>

Create an invalid UTF8 perl string?

2 Answers

You can set an arbitrary sequence of bytes with the UTF8 flag still set by hacking at the guts of a string.

use Inline C;
use Devel::Peek;
utf8::upgrade( $str = "" );
Dump($str); 
twiddle($str, "\x{BD}\x{BE}\x{BF}\x{C0}\x{C1}\x{C2}");
Dump($str);
__DATA__
__C__
/** append arbitrary bytes to a Perl scalar **/
void twiddle(SV *s, const char *t)
{
  sv_catpv(s, t);
}

Typical output:

SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 ""\0 [UTF8 ""]
  CUR = 0
  LEN = 12
SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 "\275\276\277\300\301\302"\0Malformed UTF-8 character (unexpected continuation byte 0xbd, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbe, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbf, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0xc1, immediately after start byte 0xc0) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at ./invalidUTF.pl line 6.
 [UTF8 "\x{0}\x{0}\x{0}\x{0}\x{0}"]
  CUR = 6
  LEN = 12

184

answered Sep 18 '22 23:09

mob

That's exactly what Encode's _utf8_on does.

use Encode qw( _utf8_on );

my $s = "abc\xC0def";  # String to use as raw buffer content.
utf8::downgrade($s);   # Make sure each char is stored as a byte.
_utf8_on($s);          # Set UTF8 flag.

(Never use _utf8_on except when you want to generate a bad scalar.)

You can view the damage using

use Devel::Peek qw( Dump );
Dump($s);

Output:

SV = PV(0x24899c) at 0x4a9294
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x24ab04 "abc\300def"\0Malformed UTF-8 character (unexpected non-continuation byte 0x64, immediately after start byte 0xc0) in subroutine entry at script.pl line 9.
 [UTF8 "abc\x{0}ef"]
  CUR = 7
  LEN = 12

answered Sep 20 '22 23:09

ikegami

Related questions
                            
                                How can I create a Perl variable name based on a string?
                            
                                Why am I getting a syntax error when I pass a coderef to this prototyped Perl subroutine?
                            
                                Is there a better way to pass by reference in Perl?
                            
                                How can I get 100% test coverage in a Perl module that uses DBI?
                            
                                How do I install a CPAN module site-wide while local::lib is present?
                            
                                In Perl, how can I check if a file is locked?
                            
                                What is the equivalent of Perl's (<>) in Python? fileinput doesn't work as expected
                            
                                Perl: how to share the import of a large list of modules between many independent scripts?
                            
                                Perl's use encoding pragma breaking UTF strings
                            
                                How to remove invalid characters from an xml file using sed or Perl
                            
                                Find timezone from airport code using perl code
                            
                                How to avoid errors with perl command line parameters and use strict
                            
                                How should I pass objects to subroutines?
                            
                                What exactly is the difference between calling a method via "->" vs passing class/object as first parameter?
                            
                                How can I get perl grep do perform regex expression capturing
                            
                                perl: how to understand @$ref[0]?
                            
                                How do I turn the return value of a Perl sub into an arrayref?
                            
                                Template Toolkit, test for last iteration in a nested loop
                            
                                What does "my ($self, %myInputs) = @_;" mean?
                            
                                Installing perl dependency automatically in perl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create an invalid UTF8 perl string?

Tags:

unicode

perl

ErikR

People also ask

2 Answers

mob

ikegami

Recent Activity

Donate For Us