Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?

Tags:

c

unicode

wchar

I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.

I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.

My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.

I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.

I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...

I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.

My understanding of how unicode works (in case that's where I'm going wrong):

  1. I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.

  2. I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.

  3. I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب

My understanding of the ways I can process unicode in C:

Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)

Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:

    const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";

My problems with this:

  1. I need to manipulate individual characters
  2. Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.

Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)

Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.

My problem with this:

I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.

I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.

(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)

Option C - Use external libraries

I've read in various comments that external libraries are the way to go so I've tried:

C programming library

http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.

My problem:

While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)

ICU library

My problem:

I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.

My code

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>


int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;


FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";


//printf - works 

printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");

printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);


//wprintf  - english - works

wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');

//wprintf - arabic - doesnt work

wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);

wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);

wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);


wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');


wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");

fclose(f);

return 0;
}

Output file

printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"

wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""

I'm using Windows 10, Notepad++ and MinGW.

Edit This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...

I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!

like image 340
sally2000 Avatar asked Jun 06 '17 19:06

sally2000


People also ask

Can UTF-8 store a character in more than one byte?

Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes).

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

How many possible UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

How is UTF-8 stored?

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.


1 Answers

C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.

Is it ... possible to store and process individual UTF-8 characters ...?

First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".

A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.

The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.

char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
  printf("<");
  char u[5];
  char *p = u;
  *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  *p = 0; 
  printf("%s", u);
  printf(">\n");
}

With the output viewed with a UTF8 aware screen:

<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>
like image 181
chux - Reinstate Monica Avatar answered Sep 27 '22 18:09

chux - Reinstate Monica