I am trying to replace every non alpha character in a string with " "
using Boost:
std::string sanitize(std::string &str)
{
boost::regex re;
re.imbue(std::locale("fr_FR.UTF-8"));
re.assign("[^[:alpha:]]");
str = boost::regex_replace(str, re, " ");
return str;
}
int main ()
{
std::string test = "(ça) /.2424,@ va très bien ?";
cout << sanitize(test) << endl;
return 0;
}
The result is a va tr s bien
but I would like to get ça va très bien
.
What am I missing?
Boost.Regex Boost.Regex allows you to use regular expressions in C++. As the library is part of the standard library since C++11, you don’t depend on Boost.Regex if your development environment supports C++11. You can use identically named classes and functions in the namespace std if you include the header file regex.
Depending on your use case, you might want to restrict the number of characters being entered as well. We can start building our regex by including the characters classes a-z, A-Z, and 0-9 in square brackets. The square brackets indicate that we can match any of the characters in that range.
In addition to the search string and the regular expression, boost::regex_replace () needs a format that defines how substrings that match individual groups of the regular expression should be replaced. In case the regular expression does not contain any groups, the corresponding substrings are replaced one to one using the given format.
If you want to run the example on a POSIX operating system, replace “Turkish” with “tr_TR”. Also make sure the locale for Turkish is installed. Note that boost::regex is defined with a platform-dependent second template parameter. On Windows this parameter is boost::w32_regex_traits, which allows an LCID to be passed to imbue ().
boost::regex::imbue
doesn't do what you are hoping for here - in particular, it won't make boost::regex work with UTF-8. (You could probably make it work this way with ISO 8859-1 or a similar single-byte character encoding, but that doesn't seem to be what you want here).
For UTF-8 support, you will need to use one of the boost::regex classes which will deal with Unicode - see http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/unicode.html.
Here is some code which I think does what you want:
#include <string>
#include <boost/regex/icu.hpp>
std::string sanitize(std::string &str)
{
boost::u32regex re = boost::make_u32regex("[^[:alpha:]]");
str = boost::u32regex_replace(str, re, " ");
return str;
}
int main ()
{
std::string test = "(ça) /.2424,@ va très bien ?";
std::cout << test << "\n" << sanitize(test) << std::endl;
return 0;
}
See http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html for more examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With