Just now I'm trying to use libunistring in my c program.
I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:
void print_length(uint8_t *msg) {
printf("Default strlen: %d\n", strlen((char *)msg));
printf("U8 strlen: %d\n", u8_strlen(msg));
}
Just imagine that we call print_length()
with msg = "привет"
(cyrillic, utf-8 encoding).
I've expected that strlen()
should return 12 (6 letters * 2 bytes per letter), and
u8_strlen()
should return 6 (just 6 letters).
But I recieved curious results:
Default strlen: 12
U8 strlen: 12
After this I'm tried to lookup u8_strlen realization, and found this code:
size_t
u8_strlen (const uint8_t *s)
{
return strlen ((const char *) s);
}
I'm wondering, is it bug or it's correct answer? If it's correct, why?
strlen () function in c. The strlen () function calculates the length of a given string.The strlen () function is defined in string.h header file. It doesn’t count null character ‘\0’.
libunistring is for you if your application already uses the ISO C / POSIX <ctype.h>, <wctype.h> functions and the text it operates on is provided by the user and can be in any language. libunistring is also for you if your application uses Unicode strings as internal in-memory representation.
In the format string: The format directive ‘ U ’ takes an UTF-8 string ( const uint8_t * ). The format directive ‘ lU ’ takes an UTF-16 string ( const uint16_t * ). The format directive ‘ llU ’ takes an UTF-32 string ( const uint32_t * ). A function name with an infix ‘ v ’ indicates that a va_list is passed instead of multiple arguments.
gnulib has functions mbslen and mbswidth that can be used instead of strlen when the number of characters or the number of screen columns of a string is requested. gnulib has functions mbschr and mbsrrchr that are like strchr and strrchr, but work in multibyte locales.
I believe this is the intended behavior.
The libunistring manual says that:
size_t u8_strlen (const uint8_t *s)
Returns the number of units in s.
Also in the manual, it defines what this "unit" is:
UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).
I believe the reason they label the function u8_strlen
even though it does nothing more than the standard strlen
is that the library also has u16_strlen
and u32_strlen
for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen
simply for completeness.
GNU gnulib does however include mbslen
which probably does what you want:
mbslen function: Determine the number of multibyte characters in a string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With