Is there anyway to get libc6
's regexp functions regcomp
and regexec
to work properly with multi-byte characters?
For instance, if my pattern is the utf8 characters 猫机+猫
, finding a match on the utf8 encoded string 猫机机机猫
will fail, where it should succeed.
I think this is because the character 机
's byte representation is \xe6\x9c\xba
, and the +
is matching one or more of the byte \xba
. I can make this instance work by putting parenthesis around each multibyte character in the pattern, but since this is for an application I can't require users to do this.
Is there a way to flag a pattern or string to match as containing utf8 characters? Perhaps telling libc
to store the pattern as wchar instead of char?
According to its manual page, glibc understands POSIX regexp. There is no unicode support in POSIX regexp per se. See this answer for an excerpt of the standard that enlightens this point. This means that you can also forget about UTF. This means also that whatever locale environment you're in, multi-byte characters won't fit.
The post I've mentionned (as well as this one) suggests you use some unicode-aware regexp library, such as pcre. If you're interested, pcre provides a fake posix interface, with the addition of a non-standard REG_UTF flag. You won't have to rewrite your code, except for the #include directive, and the addition of REG_UTF at compile step.
Hope this covers your needs.
Can you use a regex to build your regex? Here's a javascript example, (though I know you aren't using js):
function Examp () {
var uString = "猫机+猫+猫ymg+sah猫";
var plussed = uString.replace(/(.)(?=[\+\*])/ig,"($1)");
console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
uString = "猫机+猫*猫ymg+s\\a+I+h猫";
plussed = uString.replace(/(\\?.)(?=[\+\*])/ig,"($1)");
console.log("You can even take this a step further and account for a character being escaped, if that's a consideration.")
console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
}
<input type="button" value="Run" onclick="Examp()" />
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With