Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle PCRE with Unicode?

I'm using VisualStudio2010 to do some regular expression matching via PCRE. Let's say I have a pattern and a subject given in std::wstring like this:

std::wstring subject = L"サービス内容";
std::wstring pattern = L"ス内";

As you can see, I try to locate Japanese strings and thus I need to take the unicode variant of PCRE, for example pcre16 or pcre32 with functions pcre16_exec or pcre32_exec etc.

Unfortunately, it does not work. My problem seems to be the conversion from wstring to unsigned short or unsigned int (depends on pcre16 or pcre32). I tried a lot of functions (wcstombs_s, strings conversions with QString etc.) but without success. The result of the exec function never holds the correct values I expect. Im' not really sure what went wrong - pattern matching with ansi strings using simple pcre functions works fine. Here's a snippet:

pcre16 *re;
const char *error;
int erroffset;
int ovector[30]; //The reult of the matching
int subject_length;
int rc;

std::wstring subjectstr = L"サービス内容";
std::wstring patternstr = L"ス内";
subject_length = 6;

const unsigned short pattern = ....// string conversion from patternstr
const insigned short subject = ....// string conversion from subjectstr

re = pcre16_compile(&pattern, PCRE_UTF16, &error, &erroffset, NULL);
rc = pcre16_exec(re, NULL, &subject, subject_length, 0, 0, ovector, 30);

Can somebody please give me a working example about how to detect unicode patterns with PCRE or explain what went wrong? I become exasperated with myelf.

like image 435
MichaelXanadu Avatar asked Nov 12 '22 21:11

MichaelXanadu


1 Answers

I found the solution here.

The key was a very simple casting from wchar to const unsigned short (PCRE_SPTR16). My mind always have tried to use more complicated conversions.... In a nutshell, here's a working example for anybody might be interested. The results of the pattern matching can be found in subStrVec:

pcre16 *reCompiled;
int pcreExecRet;
int subStrVec[30];
const char *pcreErrorStr;
int pcreErrorOffset;  

std::wstring pattern = L"容内容";
std::wstring subject = L"容容容内容容容";

const wchar_t* aStrRegex = pattern.c_str();
const wchar_t* line = subject.c_str();

reCompiled = pcre16_compile((PCRE_SPTR16)aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL);
pcreExecRet = pcre16_exec(reCompiled, NULL, (PCRE_SPTR16)line, wcslen(line), 0, 0, subStrVec, 30);
like image 106
MichaelXanadu Avatar answered Nov 15 '22 04:11

MichaelXanadu