Is there a C library to convert Unicode code points to UTF-8?

Tags:

I have to go through some text and write the UTF-8 output according to the character patterns. I thought it’ll be easy if I can work with the code points and get it converted to UTF-8. I have been reading about Unicode and UTF-8, but couldn’t find a good solution. Any help will be appreciated.

977

asked Jan 05 '11 17:01

chanux

2 Answers

Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:

if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;

Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.

101

answered Sep 19 '22 20:09

R.. GitHub STOP HELPING ICE

iconv could be used I figure.

#include <iconv.h>

iconv_t cd;
char out[7];
wchar_t in = CODE_POINT_VALUE;
size_t inlen = sizeof(in), outlen = sizeof(out);

cd = iconv_open("utf-8", "wchar_t");
iconv(cd, (char **)&in, &inl, &out, &outlen);
iconv_close(cd);

But I fear that wchar_t might not represent Unicode code points, but arbitrary values.. EDIT: I guess you can do it by simply using a Unicode source:

uint16_t in = UNICODE_POINT_VALUE;
cd = iconv_open("utf-8", "ucs-2");

answered Sep 18 '22 20:09

user562374

Related questions
                            
                                .dSYM files generated from command line (Mac)
                            
                                C header file loops
                            
                                What is the purpose of ungetc (or ungetch from K&R)?
                            
                                Should useless type qualifiers on return types be used, for clarity?
                            
                                main function does not return anything. Why? [duplicate]
                            
                                Convert a uint16_t to char[2] to be sent over socket (unix)
                            
                                "Semantic issue: Implicitly declaring library function 'malloc' with type 'void *(unsigned long)'"
                            
                                Why is C quicksort function much slower (tape comparisons, tape swapping) than bubble sort function?
                            
                                Check if a value is defined in an C enum?
                            
                                sum (adding 2 numbers ) without plus operator
                            
                                If I have a void pointer, how do I put an int into it?
                            
                                Check connection open or closed ?(in C in Linux)
                            
                                Calculate Length of Array in C by Using Function
                            
                                Freeing pointers from inside other functions in C
                            
                                Character representation from hexadecimal
                            
                                How many asm-instructions per C-instruction?
                            
                                Is Template Metaprogramming faster than the equivalent C code?
                            
                                How to get character's position in alphabet in C language?
                            
                                What's wrong with this macro?
                            
                                for loop missing initialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a C library to convert Unicode code points to UTF-8?

Tags:

c

unicode

utf-8

chanux

People also ask

2 Answers

R.. GitHub STOP HELPING ICE

user562374

Recent Activity

Donate For Us