Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove letter accents from a given text

Maybe I'm missing something obvious, but is there a "painless" way to replace the accented letters in a given text with their unaccented counterparts? I can only use the standard ANSI C libraries/headers, so my hands are tied. What I've tried so far:

unsigned char currentChar;

(...)

if (currentChar == 'à') { 
    currentChar = 'a'; 
}
else if (currentChar == 'è' || currentChar == 'é') {
    currentChar = 'e'; 
}
else if (...)

However, this doesn't work. Detecting accented vowels with their extended ASCII value isn't an option, either, as I've noticed that it changes depending upon the system locale.

Any hints/suggestions?

(update)

Thanks for the answers, but I'm not really asking for the best approach for this problem - I'll think about it later. I'm simply asking for a way to detect the accented vowels, as the code above simply ignores them.

(update #2)

Okay. Let me clarify:

#include <stdio.h>

int main(void) {
    int i;
    char vowels[6] = {'à','è','é','ì','ò','ù'};
    for (i = 0; i < 6; i++) {
        switch (vowels[i]) {
            case 'à': vowels[i] = 'a'; break;
            case 'è': vowels[i] = 'e'; break;
            case 'é': vowels[i] = 'e'; break;
            case 'ì': vowels[i] = 'i'; break;
            case 'ò': vowels[i] = 'o'; break;
            case 'ù': vowels[i] = 'u'; break;
        }
     }
     printf("\n");
     for (i = 0; i < 6; i++) {
         printf("%c",vowels[i]);
     }
     printf("\n");
     return 0;
}

This code still prints "àèéìòù" as its output. This is my problem. I appreciate the answers, however it's pointless to tell me to implement a conversion map, or a switch/case structure. I'll think about it later.

like image 541
Nancy B. Avatar asked Oct 25 '25 14:10

Nancy B.


2 Answers

The accented characters are likely part of the UTF-8 character set, or some other encoding. Your program is using the char type, which usually uses the ASCII character set.

In the ASCII character set, each character is represented by a single byte. This character set does not include the accent character.

Other encodings do include the character, but it is probably not represented by a single byte and so cannot be processed by your code. The solution to this is usually to use wide characters.

What you will need are wide characters.

This question may has more general explanation.

This question may provide a solution for your case.

This code seems to do what you would like:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(int argc, char **argv){
    setlocale(LC_CTYPE, "");
    FILE *f = fopen(argv[1], "r");
    if (!f)
        return 1;

    for (wchar_t c; (c = fgetwc(f)) != WEOF;){
        switch (c) {
            case L'à': c=L'a'; break;
            case L'è': c=L'e';break;
            case L'é': c=L'e';break;
            case L'ì': c=L'i';break;
            case L'ò': c=L'o';break;
            case L'ù': c=L'u';break;
            default:    break;
        }
        wprintf(L"%lc", c);
    }

    fclose(f);
    return 0;
}
like image 199
Richard Avatar answered Oct 27 '25 02:10

Richard


There may be an easier way, some existing functionality that I haven't heard of, but as far as structure, this is how I'd approach it:

Build a table of character conversions consisting of the accent character and the resulting character. Then build a simple loop to scan the table for each character, and if found, make the change.

like image 33
Jonathan Wood Avatar answered Oct 27 '25 02:10

Jonathan Wood



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!