Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multi-byte characters in libc regcomp and regexec

Is there anyway to get libc6's regexp functions regcomp and regexec to work properly with multi-byte characters?

For instance, if my pattern is the utf8 characters 猫机+猫, finding a match on the utf8 encoded string 猫机机机猫 will fail, where it should succeed.

I think this is because the character 's byte representation is \xe6\x9c\xba, and the + is matching one or more of the byte \xba. I can make this instance work by putting parenthesis around each multibyte character in the pattern, but since this is for an application I can't require users to do this.

Is there a way to flag a pattern or string to match as containing utf8 characters? Perhaps telling libc to store the pattern as wchar instead of char?

like image 385
bill_e Avatar asked Jan 23 '15 17:01

bill_e


2 Answers

According to its manual page, glibc understands POSIX regexp. There is no unicode support in POSIX regexp per se. See this answer for an excerpt of the standard that enlightens this point. This means that you can also forget about UTF. This means also that whatever locale environment you're in, multi-byte characters won't fit.

The post I've mentionned (as well as this one) suggests you use some unicode-aware regexp library, such as pcre. If you're interested, pcre provides a fake posix interface, with the addition of a non-standard REG_UTF flag. You won't have to rewrite your code, except for the #include directive, and the addition of REG_UTF at compile step.

Hope this covers your needs.

like image 110
Champignac Avatar answered Nov 15 '22 04:11

Champignac


Can you use a regex to build your regex? Here's a javascript example, (though I know you aren't using js):

function Examp () {
  var uString = "猫机+猫+猫ymg+sah猫";
  var plussed = uString.replace(/(.)(?=[\+\*])/ig,"($1)");
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
  uString = "猫机+猫*猫ymg+s\\a+I+h猫";
  plussed = uString.replace(/(\\?.)(?=[\+\*])/ig,"($1)");
  console.log("You can even take this a step further and account for a character being escaped, if that's a consideration.")
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
}
<input type="button" value="Run" onclick="Examp()" />
like image 41
Regular Jo Avatar answered Nov 15 '22 05:11

Regular Jo