Length of string in Perl independent of character encoding

Question

The length function assumes that Chinese characters are more than one character. How do I determine length of a string in Perl independent of character encoding (treat Chinese characters as one character)?

mu is too short · Accepted Answer

The length function operates on characters, not octets (AKA bytes). The definition of a character depends on the encoding. Chinese characters are still single characters (if the encoding is correctly set!) but they take up more than one octet of space. So, the length of a string in Perl is dependent on the character encoding that Perl thinks the string is in; the only string length that is independent of the character encoding is the simple byte length.

Make sure that the string in question is flagged as UTF-8 and encoded in UTF-8. For example, this yields 3:

$ perl -e 'print length("长")'

whereas this yields 1:

$ perl -e 'use utf8; print length("长")'

as does:

$ perl -e 'use Encode; print length(Encode::decode("utf-8", "长"))'

If you're getting your Chinese characters from a file, make sure that you binmode $fh, ':utf8' the file before reading or writing it; if you're getting your data from a database, make sure the database is returning strings in UTF-8 format (or use Encode to do it for you).

I don't think you have to have everything in UTF-8, you really only need to ensure that the string is flagged as having the correct encoding. I'd go with UTF-8 front to back (and even sideways) though as that's the lingua franca for Unicode and it will make things easier if you use it everywhere.

You might want to spend some time reading the perlunicode man page if you're going to be dealing with non-ASCII data.

Length of string in Perl independent of character encoding

Tags:

string

character-encoding

unicode

perl

syker

1 Answers

mu is too short

Recent Activity

Donate For Us

Length of string in Perl independent of character encoding

Tags:

string

character-encoding

unicode

perl

syker

1 Answers

mu is too short

Related questions

Recent Activity

Donate For Us