Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

utf8 aware strncpy

Tags:

c++

c

utf-8

strncpy

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.

I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 character into the destination string.

Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).

Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.

glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.

To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 character. (Given valid utf-8 input must never result in having invalid utf-8 output).


Note:

Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.

like image 289
ideasman42 Avatar asked Sep 08 '11 07:09

ideasman42


1 Answers

I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.

char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
    if( sizeDest ){
        size_t sizeSrc = strlen(src); // number of bytes not including null
        while( sizeSrc >= sizeDest ){

            const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
            while( lastByte-- > src )
                if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
                    break;

            sizeSrc = lastByte - src;
        }
        memcpy(dst, src, sizeSrc);
        dst[sizeSrc] = '\0';
    }
    return dst;
}
like image 68
Big Al Avatar answered Sep 30 '22 01:09

Big Al