utf8 aware strncpy

Question

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.

I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 character into the destination string.

Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).

Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.

glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.

To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 character. (Given valid utf-8 input must never result in having invalid utf-8 output).

Note:

Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.

Big Al · Accepted Answer

I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.

char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
    if( sizeDest ){
        size_t sizeSrc = strlen(src); // number of bytes not including null
        while( sizeSrc >= sizeDest ){

            const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
            while( lastByte-- > src )
                if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
                    break;

            sizeSrc = lastByte - src;
        }
        memcpy(dst, src, sizeSrc);
        dst[sizeSrc] = '\0';
    }
    return dst;
}

utf8 aware strncpy

Tags:

c++

c

utf-8

strncpy

Note:

ideasman42

1 Answers

Big Al

Recent Activity

Donate For Us

utf8 aware strncpy

Tags:

c++

c

utf-8

strncpy

Note:

ideasman42

1 Answers

Big Al

Related questions

Recent Activity

Donate For Us