Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.
Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.
My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).
Any help would be appreciated..
Of the regex flavors discussed in this tutorial, Java, XML and . NET use Unicode-based regex engines. Perl supports Unicode starting with version 5.6. PCRE can optionally be compiled with Unicode support.
POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets.
To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .
Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.
From the standard, IEEE Std 1003.1-2008:
Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.
Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.
Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]"
(for example). And everything working just fine.
Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale()
before it.
#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int ret;
regex_t reg;
regmatch_t matches[10];
if (argc != 3) {
fprintf(stderr, "Usage: %s regex string\n", argv[0]);
return 1;
}
setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */
if ((ret = regcomp(®, argv[1], 0)) != 0) {
char buf[256];
regerror(ret, ®, buf, sizeof(buf));
fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
return 1;
}
if ((ret = regexec(®, argv[2], 10, matches, 0)) == 0) {
int i;
char buf[256];
int size;
for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
if (matches[i].rm_so == -1) break;
size = matches[i].rm_eo - matches[i].rm_so;
if (size >= sizeof(buf)) {
fprintf(stderr, "match (%d-%d) is too long (%d)\n",
matches[i].rm_so, matches[i].rm_eo, size);
continue;
}
buf[size] = '\0';
printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
strncpy(buf, argv[2] + matches[i].rm_so, size));
}
}
return 0;
}
Usage example:
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$
The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With