Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does POSIX regex.h provide unicode or basically non-ascii characters?

Tags:

c++

c

linux

posix

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

Any help would be appreciated..

like image 506
iyasar Avatar asked Jan 04 '12 13:01

iyasar


People also ask

Does regex use Unicode?

Of the regex flavors discussed in this tutorial, Java, XML and . NET use Unicode-based regex engines. Perl supports Unicode starting with version 5.6. PCRE can optionally be compiled with Unicode support.

What is POSIX in regex?

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets.

How do I find a non Unicode character?

To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .


2 Answers

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.

From the standard, IEEE Std 1003.1-2008:

Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.

Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

like image 146
cha0site Avatar answered Sep 22 '22 21:09

cha0site


Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

Usage example:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

like image 32
praetorian droid Avatar answered Sep 24 '22 21:09

praetorian droid