Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert ASCII string to Unicode? Windows, pure C

I've found answers to this question for many programming languages, except for C, using the Windows API. No C++ answers please. Consider the following:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
WCHAR unistring[strlen(string)+1];

What function can I use to fill unistring with the characters from string?

like image 813
user1540336 Avatar asked Jul 20 '12 09:07

user1540336


People also ask

Does C use ASCII or Unicode?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits).

Does C support Unicode?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.

Does C support UTF-8?

UTF-8 is the only text encoding mandated to be supported by the C standard for which there is no distinctly named code unit type.

How does C handle Unicode?

By default, C language only prints 8 Bit characters. Note: Unicode is not a function or method in C, so there is no specific syntax to it.


2 Answers

MultiByteToWideChar:

#include <windows.h>
char *string = "The quick brown fox jumps over the lazy dog";
size_t len = strlen(string);
WCHAR unistring[len + 1];
int result = MultiByteToWideChar(CP_OEMCP, 0, string, -1, unistring, len + 1);
like image 174
Rup Avatar answered Sep 29 '22 10:09

Rup


If you are really serious about Unicode, you should refer to International Components for Unicode, which is a cross-platform solution for handling Unicode conversions and storage in either C or C++.

Your WCHAR, for example, is not Unicode to begin with, because Microsoft somewhat prematurely defined wchar_t to be 16bit (UCS-2), and got stuck in backward compatibility hell when Unicode became 32bit: UCS-2 is almost, but not quite identical to UTF-16, the latter being in fact a multibyte encoding just like UTF-8. "Wide" format in Unicode means 32 bit (UTF-32), and even then you don't have a 1:1 relationship between code points (i.e. 32bit-values) and abstract characters (i.e. a printable glyph).

Gratuituous, losely related list of links:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
  • The UTF-8 Everywhere Manifesto
  • Commonly confused characters by Greg Baker
like image 29
DevSolar Avatar answered Sep 29 '22 11:09

DevSolar