Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do operations with 'æ', 'ø' and 'å' in C

I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.

I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.

The program is working with this array of chars:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives these warnings:

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with 'a' it replaces 'å' with "�a".

I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but it gives this error:

excess elements in char array initializer

Is there a a way to make this work without making it too complicated?

like image 202
Martin Johansen Avatar asked Sep 21 '15 12:09

Martin Johansen


2 Answers

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

like image 121
Cloud Avatar answered Nov 03 '22 19:11

Cloud


Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

You need to understand what character encoding are you using.

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

like image 40
Basile Starynkevitch Avatar answered Nov 03 '22 18:11

Basile Starynkevitch