Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of string in Perl independent of character encoding

The length function assumes that Chinese characters are more than one character. How do I determine length of a string in Perl independent of character encoding (treat Chinese characters as one character)?

like image 320
syker Avatar asked Jan 21 '23 08:01

syker


1 Answers

The length function operates on characters, not octets (AKA bytes). The definition of a character depends on the encoding. Chinese characters are still single characters (if the encoding is correctly set!) but they take up more than one octet of space. So, the length of a string in Perl is dependent on the character encoding that Perl thinks the string is in; the only string length that is independent of the character encoding is the simple byte length.

Make sure that the string in question is flagged as UTF-8 and encoded in UTF-8. For example, this yields 3:

$ perl -e 'print length("长")'

whereas this yields 1:

$ perl -e 'use utf8; print length("长")'

as does:

$ perl -e 'use Encode; print length(Encode::decode("utf-8", "长"))'

If you're getting your Chinese characters from a file, make sure that you binmode $fh, ':utf8' the file before reading or writing it; if you're getting your data from a database, make sure the database is returning strings in UTF-8 format (or use Encode to do it for you).

I don't think you have to have everything in UTF-8, you really only need to ensure that the string is flagged as having the correct encoding. I'd go with UTF-8 front to back (and even sideways) though as that's the lingua franca for Unicode and it will make things easier if you use it everywhere.

You might want to spend some time reading the perlunicode man page if you're going to be dealing with non-ASCII data.

like image 121
mu is too short Avatar answered Jan 30 '23 08:01

mu is too short