Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting C++ Boost Regexes to Python re regexes [closed]

The aim is to convert these regexes in C++ boost to Python re regexes:

  typedef boost::u32regex tRegex;

  tRegex emptyre = boost::make_u32regex("^$");
  tRegex commentre = boost::make_u32regex("^;.*$");
  tRegex versionre = boost::make_u32regex("^@\\$Date: (.*) \\$$");
  tRegex includere = boost::make_u32regex("^<(\\S+)$");
  tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
  tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
  tRegex tokre = boost::make_u32regex("^:(.*)$");
  tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
  tRegex groupendre = boost::make_u32regex("^#$");
  tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");

I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to

  • how to convert C++ boost regexest to Python and
  • what is the difference between boost regexes and python re regexes?

Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:

  • in boost, there's boost::u32regex_match, is that the same as re.match?
  • in boost, there's boost::u32regex_search, how is it different to re.search
  • there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
like image 416
alvas Avatar asked Dec 02 '15 17:12

alvas


People also ask

How do you're sub in Python?

To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.

Does Python have regex?

Regex is provided by many programming languages, such as python, java, javascript, etc.

What does re search return in Python?

The re.search() function will search the regular expression pattern and return the first occurrence. Unlike Python re. match(), it will check all lines of the input string. If the pattern is found, the match object will be returned, otherwise “null” is returned.


1 Answers

How to convert C++ boost regexest to Python

In case of a simple regex, like \w+\s+\d+, or >.*$ you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b), regular ([\w-]*) and negated ([^>]*) character classes, regular quantifiers like +/*/?, and suchlike.

what is the difference between boost regexes and python re regexes?

Python re module is not that rich as Boost regexps (suffice is to mention such constructs as \h, \G, \K, \R, \X, \Q...\E, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern) is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i) in your &amp;|&#((?i)x26);|&#38; will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).

Also, same as in Boost, you do not have to escape [ inside a character class, and { outside the character class.

The backreferences like \1 are the same as in Python.

Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).

Note the difference in named group definition: (?<NAME>expression)/(?'NAME'expression) in Boost, and (?P<NAME>expression) in Python.

I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.

The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.

Now, as for the remaining questions:

in boost, there's boost::u32regex_match, is that the same as re.match?

The re.match is not the same as regex_match as re.match is looking for the match at the beginning of the string, and regex_match requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0) that is equivalent to Boost regex_match.

in boost, there's boost::u32regex_search, how is it different to re.search

Whenver you need to find a match anywhere inside a string, you need to use re.search (see match() versus search()). Thus, this method provides analoguous functionality as regex_search does in Boost.

there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?

Python does not support Perl-like expressions to the extent Boost can, Python re module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default or perl can be found there. As for the smatch, you can use re.finditer to get all the match objects. A re.findall returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall/re.finditer difference.

And in the conclusion, a must-read article Python’s re Module.

like image 117
Wiktor Stribiżew Avatar answered Oct 12 '22 10:10

Wiktor Stribiżew