Just now I'm trying to use libunistring in my c program. I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library. Code example: <pre class="prettyprint"><code>void print_length(uint8_t *msg) { printf("Default strlen: %d\n", strlen((char *)msg)); printf("U8 strlen: %d\n", u8_strlen(msg)); } </code></pre> Just imagine that we call <code>print_length()</code> with <code>msg = "привет"</code> (cyrillic, utf-8 encoding). I've expected that <code>strlen()</code> should return 12 (6 letters * 2 bytes per letter), and <code>u8_strlen()</code> should return 6 (just 6 letters). But I recieved curious results: <pre class="prettyprint"><code>Default strlen: 12 U8 strlen: 12 </code></pre> After this I'm tried to lookup u8_strlen realization, and found this code: <pre class="prettyprint"><code>size_t u8_strlen (const uint8_t *s) { return strlen ((const char *) s); } </code></pre> I'm wondering, is it bug or it's correct answer? If it's correct, why?

I believe this is the intended behavior. The libunistring manual says that: <blockquote> size_t u8_strlen (const uint8_t *s) Returns the number of units in s. </blockquote> Also in the manual, it defines what this "unit" is: <blockquote> UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t). </blockquote> I believe the reason they label the function <code>u8_strlen</code> even though it does nothing more than the standard <code>strlen</code> is that the library also has <code>u16_strlen</code> and <code>u32_strlen</code> for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included <code>u8_strlen</code> simply for completeness. GNU gnulib does however include <code>mbslen</code> which probably does what you want: <blockquote> mbslen function: Determine the number of multibyte characters in a string. </blockquote>

libunistring u8_strlen() equals to strlen()?

Tags:

c

unicode

utf-8

Just now I'm trying to use libunistring in my c program. I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:

void print_length(uint8_t *msg) {
    printf("Default strlen: %d\n", strlen((char *)msg));
    printf("U8 strlen: %d\n", u8_strlen(msg));
}

Just imagine that we call print_length() with msg = "привет" (cyrillic, utf-8 encoding). I've expected that strlen() should return 12 (6 letters * 2 bytes per letter), and u8_strlen() should return 6 (just 6 letters).

But I recieved curious results:

Default strlen: 12
U8 strlen: 12

After this I'm tried to lookup u8_strlen realization, and found this code:

size_t
u8_strlen (const uint8_t *s)
{
    return strlen ((const char *) s);
}

I'm wondering, is it bug or it's correct answer? If it's correct, why?

886

asked Sep 26 '13 16:09

Artem Agasiev

1 Answers

I believe this is the intended behavior.

The libunistring manual says that:

size_t u8_strlen (const uint8_t *s)

Returns the number of units in s.

Also in the manual, it defines what this "unit" is:

UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).

I believe the reason they label the function u8_strlen even though it does nothing more than the standard strlen is that the library also has u16_strlen and u32_strlen for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen simply for completeness.

GNU gnulib does however include mbslen which probably does what you want:

mbslen function: Determine the number of multibyte characters in a string.

143

answered Sep 20 '22 23:09

Berry

Related questions
                            
                                Assigning a string literal to a char array, how is the string literal copied onto the stack?
                            
                                printing multiple integers at the same time
                            
                                Binary Search Tree C implementation
                            
                                AVX convert 64 bit integer to 64 bit float
                            
                                Linking with another start-up file
                            
                                Loopback example using INADDR_LOOPBACK does not work
                            
                                OS X: Is it possible to view the source code of the Standard C Library?
                            
                                Does strtol("-2147483648", 0, 0) overflow if LONG_MAX is 2147483647?
                            
                                Anonymous union and a normal union
                            
                                Will polymorphism hold for C++ object references passed around in C?
                            
                                Odd gcc warning behavior
                            
                                GLib program termination signal handling?
                            
                                Linux Serial Port: Blocking Read with Timeout
                            
                                What do with "SIGQUIT" signal when porting to mingw?
                            
                                How to pop from linked list?
                            
                                copy_to_user a struct that contains an array (pointer)
                            
                                Using ELF section in LKM
                            
                                how to compile gtk+ application for native windows (and not for X windows)?
                            
                                How do I change the compiler in Xcode
                            
                                MIPS to C Translation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With