Before anyone will tells me to RTFM, I must say - I have digged through: <ul> <li>Why does modern Perl avoid UTF-8 by default?</li> <li>Checklist for going the Unicode way with Perl</li> <li>How to match string with diacritic in perl?</li> <li>How to make "use My::defaults" with modern perl & utf8 defaults?</li> <li>and many others (like perluniintro and others) - but - sure, missed something</li> </ul> So, the basic code: <pre class="prettyprint"><code>use 5.014; #getting 'unicode_strings' feature use uni::perl; #turning on many utf8 things use Unicode::Normalize qw(NFD NFC); use warnings; while(<>) { chomp; my $data = NFD($_); say "OK" if utf8::is_utf8($data); } </code></pre> At this point, from the utf8 encoded STDIN I got a correct unicode string in <code>$data</code>, e.g. "\w" will match multibyte <code>[\p{Alphabetic}\p{Decimal_Number}\p{Letter_Number}]</code> (maybe something more). That's ok and works. AFAIK <code>$data</code> does not contain utf8, but a string in <code>perl's internal Unicode</code> format. Now the questions: <ul> <li>HOW can I ensure (test it), that any <code>$other_data</code> contains valid Unicode string?</li> <li>For what purpose is the utf8::is_utf8($data)? The whole utf8 pragma is a mystery for me.</li> </ul> I understand that the <code>use utf8;</code> is only for the purpose of telling Perl that my source code is in utf8 (so do similar things as when my script starts with BOM flag - for BigEndian) - from Perl's point of view, my source code is like an external file - and Perl should know in what encoding it is... In the above example <code>utf8::is_utf8($data)</code> will print OK - but I don't understand WHY. Internally Perl does not use utf8, so my utf8 data-file is converted into Perl's internal Unicode, so why does the <code>utf8::is_utf8($data)</code> return true for <code>$data</code>, which is not in utf8 format? Or it is misnamed and the function should be named as uni::is_unicode($data)??? Thanks in advance for clarification. Ps: @brian d foy - yes, I still don't have the Effective Perl Programming book - I will get it - I promise :) /joking/

<code>is_utf8</code> returns information about which internal storage format was used, period. <ul> <li>It's not related to the value of the string (though certain strings can only be stored in one of the two formats).</li> <li>It's not related to whether the string has been decoded or not.</li> <li>It's not related to whether the string contains something that has been encoded using UTF-8 or not.</li> <li>It's not a validity check of any kind.</li> </ul> Now on to your questions. <hr> <blockquote>The whole utf8 pragma is a mystery for me.</blockquote> <code>use utf8;</code> tells <code>perl</code> your source code is encoded using UTF-8. If you don't tell it so, <code>perl</code> effectively assumes it's iso-8859-1 (as a side-effect of internal mechanisms). The functions in the utf8:: namespace are unrelated to the pragma, and they serve a variety of purposes. <ul> <li> <code>utf8::encode</code> and <code>utf8::decode</code>: Useful encoding and decoding functions. Similar to Encode's <code>encode_utf8</code> and <code>decode_utf8</code>, but they work in-place.</li> <li> <code>utf8::upgrade</code> and <code>utf8::downgrade</code>: Rarely used, but useful for working around bugs in XS modules. More on this below.</li> <li> <code>utf8::is_utf8</code>: I don't know why someone would ever use that.</li> </ul> <hr> <blockquote>HOW i can ensure (test it), than any $other_data contains valid unicode string?</blockquote> What does "valid Unicode string" mean to you? Unicode has different definitions of valid for different circumstances. <hr> <blockquote>for what purpose is the utf8::is_utf8($data)?</blockquote> Debugging. It peeks at Perl guts. <hr> <blockquote>In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.</blockquote> Because NFD happens to have chosen to return a scalar containing a string in the UTF8=1 format. Perl has two formats for storing strings: <ul> <li>UTF8=0 can store a sequence of 8-bit values.</li> <li>UTF8=1 can store a sequence of 72-bit values (although practically limited to 32 or 64 bits.)</li> </ul> The first format uses less memory and is faster when it comes to access a specific position in the string, but it's limited in what it can contain. (For example, it can't store Unicode code points since they require 21 bits.) Perl can freely switch between the two. <pre class="prettyprint"><code>use utf8; use feature qw( say ); my $d = my $u = "abcdé"; utf8::downgrade($d); # Switch to using the UTF8=0 format for $d. utf8::upgrade($u); # Switch to using the UTF8=1 format for $u. say utf8::is_utf8($d) ?1:0; # 0 say utf8::is_utf8($u) ?1:0; # 1 say $d eq $u ?1:0; # 1 </code></pre> One normally doesn't have to worry about this, but there are buggy modules. There are even buggy corners of Perl remaining despite <code>use feature qw( unicode_strings );</code>. One can use <code>utf8::upgrade</code> and <code>utf8::downgrade</code> for changing the format of a scalar to that expected by the XS function. <hr> <blockquote>Or it is miss-named and the function should be named as uni::is_unicode($data)???</blockquote> That's no better. Perl has no way to know whether a string is a Unicode string or not. If you need to track that, you need to track it yourself. Strings in the UTF8=0 format may contain Unicode code points. <pre class="prettyprint"><code>my $s = "abc"; # U+0041,0042,0043 </code></pre> Strings in the UTF8=1 format may contain values that aren't Unicode code points. <pre class="prettyprint"><code>my $s = pack('W*', @temperature_measurements); </code></pre>

Perl Unicode internals - mess with utf8

Tags:

unicode

utf-8

perl

Before anyone will tells me to RTFM, I must say - I have digged through:

Why does modern Perl avoid UTF-8 by default?
Checklist for going the Unicode way with Perl
How to match string with diacritic in perl?
How to make "use My::defaults" with modern perl & utf8 defaults?
and many others (like perluniintro and others) - but - sure, missed something

So, the basic code:

use 5.014;           #getting 'unicode_strings' feature
use uni::perl;       #turning on many utf8 things
use Unicode::Normalize  qw(NFD NFC);
use warnings;
while(<>) {
    chomp;
    my $data = NFD($_);
    say "OK" if utf8::is_utf8($data);
}

At this point, from the utf8 encoded STDIN I got a correct unicode string in $data, e.g. "\w" will match multibyte [\p{Alphabetic}\p{Decimal_Number}\p{Letter_Number}] (maybe something more). That's ok and works.

AFAIK $data does not contain utf8, but a string in perl's internal Unicode format.

Now the questions:

HOW can I ensure (test it), that any $other_data contains valid Unicode string?
For what purpose is the utf8::is_utf8($data)? The whole utf8 pragma is a mystery for me.

I understand that the use utf8; is only for the purpose of telling Perl that my source code is in utf8 (so do similar things as when my script starts with BOM flag - for BigEndian) - from Perl's point of view, my source code is like an external file - and Perl should know in what encoding it is...

In the above example utf8::is_utf8($data) will print OK - but I don't understand WHY.

Internally Perl does not use utf8, so my utf8 data-file is converted into Perl's internal Unicode, so why does the utf8::is_utf8($data) return true for $data, which is not in utf8 format? Or it is misnamed and the function should be named as uni::is_unicode($data)???

Thanks in advance for clarification.

Ps: @brian d foy - yes, I still don't have the Effective Perl Programming book - I will get it - I promise :) /joking/

768

asked May 30 '12 21:05

cajwine

2 Answers

HOW i can ensure (test it), than any $other_data contains valid unicode string?

You cannot determine ex post facto whether a string has character semantics or byte semantics. Perl does not track this for you. You have to track it by careful programming: encode and decode at the boundaries; :raw layer for byte semantics, :encoding(foo) for character semantics. Employ naming conventions for your variables and functions to clearly differentiate between the semantics and make wrong code look wrong.

for what purpose is the utf8::is_utf8($data)?

It tells you the presence of the SvUTF8 flag, nothing more. This is almost entirely useless for most developers, because it is an internals thing. The flag does not mean that a string has character semantics, its absence does not mean that a string has byte semantics.

The whole utf8 pragma is a mystery for me.

Probably because it is overdocumented, and therefore confusing. Most developers can stop reading after the part where is says that its purpose is to enable Unicode literals in the source code.

In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.

Because of uni::perl which enables use open qw(:utf8 :std);. Any input read from STDIN with <> will be decoded. The normalisation step afterwards does not change that.

answered Oct 19 '22 01:10

daxim

is_utf8 returns information about which internal storage format was used, period.

It's not related to the value of the string (though certain strings can only be stored in one of the two formats).
It's not related to whether the string has been decoded or not.
It's not related to whether the string contains something that has been encoded using UTF-8 or not.
It's not a validity check of any kind.

Now on to your questions.

The whole utf8 pragma is a mystery for me.

use utf8; tells perl your source code is encoded using UTF-8. If you don't tell it so, perl effectively assumes it's iso-8859-1 (as a side-effect of internal mechanisms).

The functions in the utf8:: namespace are unrelated to the pragma, and they serve a variety of purposes.

utf8::encode and utf8::decode: Useful encoding and decoding functions. Similar to Encode's encode_utf8 and decode_utf8, but they work in-place.
utf8::upgrade and utf8::downgrade: Rarely used, but useful for working around bugs in XS modules. More on this below.
utf8::is_utf8: I don't know why someone would ever use that.

HOW i can ensure (test it), than any $other_data contains valid unicode string?

What does "valid Unicode string" mean to you? Unicode has different definitions of valid for different circumstances.

for what purpose is the utf8::is_utf8($data)?

Debugging. It peeks at Perl guts.

In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.

Because NFD happens to have chosen to return a scalar containing a string in the UTF8=1 format.

Perl has two formats for storing strings:

UTF8=0 can store a sequence of 8-bit values.
UTF8=1 can store a sequence of 72-bit values (although practically limited to 32 or 64 bits.)

The first format uses less memory and is faster when it comes to access a specific position in the string, but it's limited in what it can contain. (For example, it can't store Unicode code points since they require 21 bits.) Perl can freely switch between the two.

use utf8;
use feature qw( say );

my $d = my $u = "abcdé";
utf8::downgrade($d);  # Switch to using the UTF8=0 format for $d.
utf8::upgrade($u);    # Switch to using the UTF8=1 format for $u.

say utf8::is_utf8($d) ?1:0;   # 0
say utf8::is_utf8($u) ?1:0;   # 1
say $d eq $u          ?1:0;   # 1

One normally doesn't have to worry about this, but there are buggy modules. There are even buggy corners of Perl remaining despite use feature qw( unicode_strings );. One can use utf8::upgrade and utf8::downgrade for changing the format of a scalar to that expected by the XS function.

Or it is miss-named and the function should be named as uni::is_unicode($data)???

That's no better. Perl has no way to know whether a string is a Unicode string or not. If you need to track that, you need to track it yourself.

Strings in the UTF8=0 format may contain Unicode code points.

my $s = "abc";  # U+0041,0042,0043

Strings in the UTF8=1 format may contain values that aren't Unicode code points.

my $s = pack('W*', @temperature_measurements);

172

answered Oct 19 '22 00:10

ikegami

Related questions
                            
                                Is there an way to programmatically read a file from a TrueCrypt disk into memory?
                            
                                How can I match strings that don't match a particular pattern in Perl?
                            
                                How does passing parameters to a perl module when using it work?
                            
                                what is the best way to string overload on a Moose attribute accessor?
                            
                                %ENV doesn't work and I cannot use shared library
                            
                                Are there rules which tell me what form of STDOUT/STDERR/SDTIN I have to choose?
                            
                                Python Regex Sub - Use Match as Dict Key in Substitution
                            
                                Perl verbose output?
                            
                                tsung_stats.pl on Mac OS X Mavericks run into "Can't locate Template.pm" error
                            
                                How can I view the contents of a hash in Perl6 (in a fashion similar to the Perl 5 modules Data::Dump or Data::Show)?
                            
                                How to convince Devel::Trace to print the BEGIN-block statements?
                            
                                What Perl modules are useful for validating subroutine arguments?
                            
                                What's the point of `use vars` in this Perl subroutine?
                            
                                How do I change, delete, or insert a line in a file, or append to the beginning of a file in Perl?
                            
                                How do I access a value of a nested Perl hash?
                            
                                How to do perl inline regex without setting to a variable?
                            
                                Installing GD library for perl on MacOSX 10.6
                            
                                How to get started writing Perl bindings for a C++ library?
                            
                                Print: producing no output
                            
                                Shouldn't accessing @Whatever::whatever produce at least a warning instead of an empty array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With