Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boost regex: [:alpha:] and accented characters

I am trying to replace every non alpha character in a string with " " using Boost:

std::string sanitize(std::string &str)
{
    boost::regex re;
    re.imbue(std::locale("fr_FR.UTF-8"));
    re.assign("[^[:alpha:]]");
    str = boost::regex_replace(str, re, " ");
    return str;
}


int main ()
{
    std::string test = "(ça) /.2424,@ va très bien ?";
    cout << sanitize(test) << endl;
    return 0;
}

The result is a va tr s bien but I would like to get ça va très bien.

What am I missing?

like image 262
Nicolas Avatar asked Feb 24 '14 13:02

Nicolas


People also ask

What is boost regex in C++?

Boost.Regex Boost.Regex allows you to use regular expressions in C++. As the library is part of the standard library since C++11, you don’t depend on Boost.Regex if your development environment supports C++11. You can use identically named classes and functions in the namespace std if you include the header file regex.

How do you make a regex with square brackets?

Depending on your use case, you might want to restrict the number of characters being entered as well. We can start building our regex by including the characters classes a-z, A-Z, and 0-9 in square brackets. The square brackets indicate that we can match any of the characters in that range.

How to replace individual groups of regular expressions in boost?

In addition to the search string and the regular expression, boost::regex_replace () needs a format that defines how substrings that match individual groups of the regular expression should be replaced. In case the regular expression does not contain any groups, the corresponding substrings are replaced one to one using the given format.

How do I run a boost regex example in Turkish?

If you want to run the example on a POSIX operating system, replace “Turkish” with “tr_TR”. Also make sure the locale for Turkish is installed. Note that boost::regex is defined with a platform-dependent second template parameter. On Windows this parameter is boost::w32_regex_traits, which allows an LCID to be passed to imbue ().


1 Answers

boost::regex::imbue doesn't do what you are hoping for here - in particular, it won't make boost::regex work with UTF-8. (You could probably make it work this way with ISO 8859-1 or a similar single-byte character encoding, but that doesn't seem to be what you want here).

For UTF-8 support, you will need to use one of the boost::regex classes which will deal with Unicode - see http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/unicode.html.

Here is some code which I think does what you want:

#include <string>
#include <boost/regex/icu.hpp>

std::string sanitize(std::string &str)
{
    boost::u32regex re = boost::make_u32regex("[^[:alpha:]]");
    str = boost::u32regex_replace(str, re, " ");
    return str;
}


int main ()
{
    std::string test = "(ça) /.2424,@ va très bien ?";
    std::cout << test << "\n" << sanitize(test) << std::endl;
    return 0;
}

See http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html for more examples.

like image 187
richvdh Avatar answered Sep 30 '22 16:09

richvdh