I am trying to filter strings that escapes special characters and transforms it into lowercase. For example: "Good morning!"
is transformed into good morning
.
I am passing one string at the time to my function.
I am successfully filtering my strings that are in English language but I have problems when I am passing strings that are in my native language.
What type of regex filter string should I use if I want to include all utf-8 characters?
#include <string>
#include <iostream>
#include <regex>
#include <algorithm>
std::string process(std::string s) {
std::string st;
std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
std::smatch m = *i;
st = m.str();
std::transform(st.begin(), st.end(), st.begin(), ::tolower);
return st;
}
int main() {
std::string st = "ąžuolas!";
std::cout << process(st) << std::endl; // <- gives: uolas
return 0;
}
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s) , even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.
Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).
You can match any unicode 'letter' character using the regex \p{L}\p{M}*
.
Therefore, the complete regex will be:
((?:\p{L}\p{M}*)+(?:['_-](?:\p{L}\p{M}*)+)*)
Demo
Source
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With