I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work. I'm using this code: <pre class="prettyprint"><code>for (it= dictionary.begin(); it != dictionary.end(); it++) { strMine=(it->first); found=toReplace.find_first_of(strMine); while (found!=std::string::npos) { strAux=(it->second); toReplace.erase(found,strMine.length()); toReplace.insert(found,strAux); found=toReplace.find_first_of(strMine,found+1); } } </code></pre> Where <code>dictionary</code> is a map like this (with more entries): <pre class="prettyprint"><code>dictionary.insert ( std::pair<std::string,std::string>("á","a") ); dictionary.insert ( std::pair<std::string,std::string>("é","e") ); dictionary.insert ( std::pair<std::string,std::string>("í","i") ); dictionary.insert ( std::pair<std::string,std::string>("ó","o") ); dictionary.insert ( std::pair<std::string,std::string>("ú","u") ); dictionary.insert ( std::pair<std::string,std::string>("ñ","n") ); </code></pre> and <code>toReplace</code> strings is: <pre class="prettyprint"><code>std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ"; </code></pre> I obviously must be missing something. I can't figure it out. Is there any library I can use?. Thanks,

First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression. What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that. That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter <pre class="prettyprint"><code>NFD; [:M:] remove; NFC </code></pre> as “Compound 1” and click transform. (With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

How to remove accents and tilde in a C++ std::string

Tags:

c++

string

text

str-replace

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

I'm using this code:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

Thanks,

383

asked Sep 27 '08 23:09

Alejo

2 Answers

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

192

answered Sep 17 '22 13:09

MSalters

First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.

What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.

That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter

NFD; [:M:] remove; NFC

as “Compound 1” and click transform.

(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

answered Sep 20 '22 13:09

andrewdotn

Related questions
                            
                                Is div function useful (stdlib.h)? [duplicate]
                            
                                Avoid contents of an existing file to be overwritten when writing to a file
                            
                                boost::python: compilation fails because copy constructor is private
                            
                                multithread read from disk?
                            
                                c++ power of integer, template meta programming
                            
                                C++: Setenv(). Undefined identifier in Visual Studio
                            
                                C++ "error: passing 'const std::map<int, std::basic_string<char> >' as 'this' argument of ..."
                            
                                Zero-initializing an array data member in a constructor
                            
                                C++ Multiplying elements in a vector
                            
                                calling child methods from parent pointer with different child classes
                            
                                Is there a standard #include convention for C++?
                            
                                How to get a "bus error"?
                            
                                c++ pow(2,1000) is normaly to big for double, but it's working. why?
                            
                                Is increment an integer atomic in x86? [duplicate]
                            
                                Understanding the benefits of move semantics vs template metaprogramming
                            
                                Printf is not printing anything to output? C++ SDL
                            
                                Why do we need abstract classes in C++?
                            
                                View default include path of C headers in Mac OS X by `gcc -v`?
                            
                                How to get last character of string in c++? [duplicate]
                            
                                Structured binding to replace std::tie abuse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With