I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.
I'm using this code:
for (it= dictionary.begin(); it != dictionary.end(); it++)
{
strMine=(it->first);
found=toReplace.find_first_of(strMine);
while (found!=std::string::npos)
{
strAux=(it->second);
toReplace.erase(found,strMine.length());
toReplace.insert(found,strAux);
found=toReplace.find_first_of(strMine,found+1);
}
}
Where dictionary
is a map like this (with more entries):
dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );
and toReplace
strings is:
std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";
I obviously must be missing something. I can't figure it out. Is there any library I can use?.
Thanks,
In C++ we can do this task very easily using erase() and remove() function. The remove function takes the starting and ending address of the string, and a character that will be removed.
Remove a Character from String using std::erase() in C++20 The C++20 introduced a new STL Algorithm, std::erase(container, element), to delete all occurrences of an element from a container. It accepts two arguments, An STL Container from which we need to delete elements. Value of the element to be deleted.
Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.
I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)
Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.
There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?
First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.
What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.
That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter
NFD; [:M:] remove; NFC
as “Compound 1” and click transform.
(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With