Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble comparing UTF-8 characters using wchar.h

Tags:

c

utf-8

widechar

I am in the process of making a small program that reads a file, that contains UTF-8 elements, char by char. After reading a char it compares it with a few other characters and if there is a match it replaces the character in the file with an underscore '_'.

(Well, it actually makes a duplicate of that file with specific letters replaced by underscores.)

I'm not sure where exactly I'm messing up here but it's most likely everywhere.

Here is my code:

   FILE *fpi;
   FILE *fpo;
   char ifilename[FILENAME_MAX];
   char ofilename[FILENAME_MAX];
   wint_t sample;


   fpi = fopen(ifilename, "rb");
   fpo = fopen(ofilename, "wb");

   while (!feof(fpi)) {
     fread(&sample, sizeof(wchar_t*), 1, fpi);

     if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0)  ) {
   fwrite(L"_", sizeof(wchar_t*), 1, fpo);

     } else {
       fwrite(&sample, sizeof(wchar_t*), 1, fpo);

     }
   } 

I have omitted the code that has to do with the filename generation because it has nothing to offer to the case. It is just string manipulation.

If I feed this program a file containing the words γειά σου κόσμε. I would want it to return this: γει_ σου κόσμ_.

Searching the internet didn't help much as most results were very general or talking about completely different things regarding UTF-8. It's like nobody needs to manipulate single characters for some reason.

Anything pointing me the right way is most welcome. I am not, necessarily, looking for a straightforward fixed version of the code I submitted, I would be grateful for any insightful comments helping me understand how exactly the wchar mechanism works. The whole wbyte, wchar, L, no-L, thing is a mess to me.

Thank you in advance for your help.

like image 533
Eternal_Light Avatar asked Jan 15 '23 14:01

Eternal_Light


1 Answers

C has two different kinds of characters: multibyte characters and wide characters.

Multibyte characters can take a varying number of bytes. For instance, in UTF-8 (which is a variable-length encoding of Unicode), a takes 1 byte, while α takes 2 bytes.

Wide characters always take the same number of bytes. Additionally, a wchar_t must be able to hold any single character from the execution character set. So, when using UTF-32, both a and α take 4 bytes each. Unfortunately, some platforms made wchar_t 16 bits wide: such platforms cannot correctly support characters beyond the BMP using wchar_t. If __STDC_ISO_10646__ is defined, wchar_t holds Unicode code-points, so must be (at least) 4 bytes long (technically, it must be at least 21-bits long).

So, when using UTF-8, you should use multibyte characters, which are stored in normal char variables (but beware of strlen(), which counts bytes, not multibyte characters).

Unfortunately, there is more to Unicode than this.

ά can be represented as a single Unicode codepoint, or as two separate codepoints:

  • U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes (0xCE 0xAC) = 2 char's.
  • U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes (0xCE 0xB1 0xCC 0x81) = 4 char's.
  • U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes (0xE1 0xBD 0xB1) = 3 char's.

All of the above are canonical equivalents, which means that they should be treated as equal for all purposes. So, you should normalize your strings on input/output, using one of the Unicode normalization algorithms (there are 4: NFC, NFD, NFKC, NFKD).

like image 79
ninjalj Avatar answered Jan 27 '23 21:01

ninjalj