Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compress Non-ASCII characters to 1 byte in C for Linux?

I have a list of Turkish words. I need to compare their lengths. But since some Turkish characters are non-ASCII, I can't compare their lengths correctly. Non-ASCII Turkish characters holds 2 bytes.

For example:

#include <stdio.h>
#include <string.h>

int main()
{
    char s1[] = "ab";
    char s2[] = "çş";

    printf("%d\n", strlen(s1)); // it prints 2
    printf("%d\n", strlen(s2)); // it prints 4

    return 0;
}

My friend said it's possible to do that in Windows with the line of code below:

system("chcp 1254");

He said that it fills the Turkish chars to the extended ASCII table. However it doesn't work in Linux.

Is there a way to do that in Linux?

like image 425
Atreidex Avatar asked Mar 08 '23 10:03

Atreidex


1 Answers

It's 2017 and soon 2018. So use UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib).

The chcp 1254 thing is some Windows specific stuff irrelevant on UTF-8 systems. So forget about it.

If you code a GUI application, use a widget toolkit like GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) to do that.

If you happen to get some file, you need to know the encoding it is using (and that is only some convention that you need to get and follow). In some cases, the file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) command.

like image 78
Basile Starynkevitch Avatar answered Apr 01 '23 14:04

Basile Starynkevitch