Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove accents and tilde in a C++ std::string

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

I'm using this code:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

Thanks,

like image 383
Alejo Avatar asked Sep 27 '08 23:09

Alejo


People also ask

How do you strip a character in C++?

In C++ we can do this task very easily using erase() and remove() function. The remove function takes the starting and ending address of the string, and a character that will be removed.

How do I remove a character from a string in STL?

Remove a Character from String using std::erase() in C++20 The C++20 introduced a new STL Algorithm, std::erase(container, element), to delete all occurrences of an element from a container. It accepts two arguments, An STL Container from which we need to delete elements. Value of the element to be deleted.

How do I remove the accented character in Java?

Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.


2 Answers

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

like image 192
MSalters Avatar answered Sep 17 '22 13:09

MSalters


First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.

What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.

That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter

NFD; [:M:] remove; NFC

as “Compound 1” and click transform.

(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

like image 32
andrewdotn Avatar answered Sep 20 '22 13:09

andrewdotn