Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

libunistring u8_strlen() equals to strlen()?

Tags:

c

unicode

utf-8

Just now I'm trying to use libunistring in my c program. I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:

void print_length(uint8_t *msg) {
    printf("Default strlen: %d\n", strlen((char *)msg));
    printf("U8 strlen: %d\n", u8_strlen(msg));
}

Just imagine that we call print_length() with msg = "привет" (cyrillic, utf-8 encoding). I've expected that strlen() should return 12 (6 letters * 2 bytes per letter), and u8_strlen() should return 6 (just 6 letters).

But I recieved curious results:

Default strlen: 12
U8 strlen: 12

After this I'm tried to lookup u8_strlen realization, and found this code:

size_t
u8_strlen (const uint8_t *s)
{
    return strlen ((const char *) s);
}

I'm wondering, is it bug or it's correct answer? If it's correct, why?

like image 886
Artem Agasiev Avatar asked Sep 26 '13 16:09

Artem Agasiev


People also ask

What is strlen() function in C?

strlen () function in c. The strlen () function calculates the length of a given string.The strlen () function is defined in string.h header file. It doesn’t count null character ‘\0’.

What is libunistring in C++?

libunistring is for you if your application already uses the ISO C / POSIX <ctype.h>, <wctype.h> functions and the text it operates on is provided by the user and can be in any language. libunistring is also for you if your application uses Unicode strings as internal in-memory representation.

What is the difference between ‘U’ and ‘Lu’ in a string?

In the format string: The format directive ‘ U ’ takes an UTF-8 string ( const uint8_t * ). The format directive ‘ lU ’ takes an UTF-16 string ( const uint16_t * ). The format directive ‘ llU ’ takes an UTF-32 string ( const uint32_t * ). A function name with an infix ‘ v ’ indicates that a va_list is passed instead of multiple arguments.

What is the difference between Gnulib and strlen?

gnulib has functions mbslen and mbswidth that can be used instead of strlen when the number of characters or the number of screen columns of a string is requested. gnulib has functions mbschr and mbsrrchr that are like strchr and strrchr, but work in multibyte locales.


1 Answers

I believe this is the intended behavior.

The libunistring manual says that:

size_t u8_strlen (const uint8_t *s)

Returns the number of units in s.

Also in the manual, it defines what this "unit" is:

UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).

I believe the reason they label the function u8_strlen even though it does nothing more than the standard strlen is that the library also has u16_strlen and u32_strlen for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen simply for completeness.

GNU gnulib does however include mbslen which probably does what you want:

mbslen function: Determine the number of multibyte characters in a string.

like image 143
Berry Avatar answered Sep 20 '22 23:09

Berry