Remove letter accents from a given text

Question

Maybe I'm missing something obvious, but is there a "painless" way to replace the accented letters in a given text with their unaccented counterparts? I can only use the standard ANSI C libraries/headers, so my hands are tied. What I've tried so far:

unsigned char currentChar;

(...)

if (currentChar == 'à') { 
    currentChar = 'a'; 
}
else if (currentChar == 'è' || currentChar == 'é') {
    currentChar = 'e'; 
}
else if (...)

However, this doesn't work. Detecting accented vowels with their extended ASCII value isn't an option, either, as I've noticed that it changes depending upon the system locale.

Any hints/suggestions?

(update)

Thanks for the answers, but I'm not really asking for the best approach for this problem - I'll think about it later. I'm simply asking for a way to detect the accented vowels, as the code above simply ignores them.

(update #2)

Okay. Let me clarify:

#include <stdio.h>

int main(void) {
    int i;
    char vowels[6] = {'à','è','é','ì','ò','ù'};
    for (i = 0; i < 6; i++) {
        switch (vowels[i]) {
            case 'à': vowels[i] = 'a'; break;
            case 'è': vowels[i] = 'e'; break;
            case 'é': vowels[i] = 'e'; break;
            case 'ì': vowels[i] = 'i'; break;
            case 'ò': vowels[i] = 'o'; break;
            case 'ù': vowels[i] = 'u'; break;
        }
     }
     printf("
");
     for (i = 0; i < 6; i++) {
         printf("%c",vowels[i]);
     }
     printf("
");
     return 0;
}

This code still prints "àèéìòù" as its output. This is my problem. I appreciate the answers, however it's pointless to tell me to implement a conversion map, or a switch/case structure. I'll think about it later.

Richard · Accepted Answer

The accented characters are likely part of the UTF-8 character set, or some other encoding. Your program is using the char type, which usually uses the ASCII character set.

In the ASCII character set, each character is represented by a single byte. This character set does not include the accent character.

Other encodings do include the character, but it is probably not represented by a single byte and so cannot be processed by your code. The solution to this is usually to use wide characters.

What you will need are wide characters.

This question may has more general explanation.

This question may provide a solution for your case.

This code seems to do what you would like:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(int argc, char **argv){
    setlocale(LC_CTYPE, "");
    FILE *f = fopen(argv[1], "r");
    if (!f)
        return 1;

    for (wchar_t c; (c = fgetwc(f)) != WEOF;){
        switch (c) {
            case L'à': c=L'a'; break;
            case L'è': c=L'e';break;
            case L'é': c=L'e';break;
            case L'ì': c=L'i';break;
            case L'ò': c=L'o';break;
            case L'ù': c=L'u';break;
            default:    break;
        }
        wprintf(L"%lc", c);
    }

    fclose(f);
    return 0;
}

Jonathan Wood · Answer

There may be an easier way, some existing functionality that I haven't heard of, but as far as structure, this is how I'd approach it:

Build a table of character conversions consisting of the accent character and the resulting character. Then build a simple loop to scan the table for each character, and if found, make the change.

Remove letter accents from a given text

Tags:

c

character-encoding

ascii

non-ascii-characters

Nancy B.

2 Answers

Richard

Jonathan Wood

Recent Activity

Donate For Us

Remove letter accents from a given text

Tags:

c

character-encoding

ascii

non-ascii-characters

Nancy B.

2 Answers

Richard

Jonathan Wood

Related questions

Recent Activity

Donate For Us