The length function assumes that Chinese characters are more than one character. How do I determine length of a string in Perl independent of character encoding (treat Chinese characters as one character)?
The length
function operates on characters, not octets (AKA bytes). The definition of a character depends on the encoding. Chinese characters are still single characters (if the encoding is correctly set!) but they take up more than one octet of space. So, the length of a string in Perl is dependent on the character encoding that Perl thinks the string is in; the only string length that is independent of the character encoding is the simple byte length.
Make sure that the string in question is flagged as UTF-8 and encoded in UTF-8. For example, this yields 3:
$ perl -e 'print length("长")'
whereas this yields 1:
$ perl -e 'use utf8; print length("长")'
as does:
$ perl -e 'use Encode; print length(Encode::decode("utf-8", "长"))'
If you're getting your Chinese characters from a file, make sure that you binmode $fh, ':utf8'
the file before reading or writing it; if you're getting your data from a database, make sure the database is returning strings in UTF-8 format (or use Encode
to do it for you).
I don't think you have to have everything in UTF-8, you really only need to ensure that the string is flagged as having the correct encoding. I'd go with UTF-8 front to back (and even sideways) though as that's the lingua franca for Unicode and it will make things easier if you use it everywhere.
You might want to spend some time reading the perlunicode man page if you're going to be dealing with non-ASCII data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With