Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering string using regex in utf8 format

I am trying to filter strings that escapes special characters and transforms it into lowercase. For example: "Good morning!" is transformed into good morning.
I am passing one string at the time to my function.
I am successfully filtering my strings that are in English language but I have problems when I am passing strings that are in my native language.
What type of regex filter string should I use if I want to include all utf-8 characters?

#include <string>
#include <iostream>
#include <regex>
#include <algorithm>

std::string process(std::string s) {
    std::string st;
    std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
    std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
    std::smatch m = *i;
    st = m.str();
    std::transform(st.begin(), st.end(), st.begin(), ::tolower);
    return st;
}

int main() {
    std::string st = "ąžuolas!";
    std::cout << process(st) << std::endl; // <- gives: uolas
    return 0;
}
like image 694
dqmis Avatar asked May 21 '19 07:05

dqmis


People also ask

What is ?: In regex?

It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s) , even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.

What is regex AZ match?

The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.

Is a special character in regex?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).


1 Answers

You can match any unicode 'letter' character using the regex \p{L}\p{M}*.

Therefore, the complete regex will be:

((?:\p{L}\p{M}*)+(?:['_-](?:\p{L}\p{M}*)+)*)

Demo

Source

like image 110
Anmol Singh Jaggi Avatar answered Oct 17 '22 16:10

Anmol Singh Jaggi