The aim is to convert these regexes in C++
boost
to Python
re
regexes:
typedef boost::u32regex tRegex;
tRegex emptyre = boost::make_u32regex("^$");
tRegex commentre = boost::make_u32regex("^;.*$");
tRegex versionre = boost::make_u32regex("^@\\$Date: (.*) \\$$");
tRegex includere = boost::make_u32regex("^<(\\S+)$");
tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
tRegex tokre = boost::make_u32regex("^:(.*)$");
tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
tRegex groupendre = boost::make_u32regex("^#$");
tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");
I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to
Is the C++ boost::u32regex
the same as re
regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:
boost::u32regex_match
, is that the same as
re.match
?boost::u32regex_search
, how is it different to re.search
boost::format_perl
and boost::match_default
and boost::smatch
, what are their equivalence in python
re
?To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.
Regex is provided by many programming languages, such as python, java, javascript, etc.
The re.search() function will search the regular expression pattern and return the first occurrence. Unlike Python re. match(), it will check all lines of the input string. If the pattern is found, the match object will be returned, otherwise “null” is returned.
How to convert C++ boost regexest to Python
In case of a simple regex, like \w+\s+\d+
, or >.*$
you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b
), regular ([\w-]*
) and negated ([^>]*
) character classes, regular quantifiers like +
/*
/?
, and suchlike.
what is the difference between boost regexes and python
re
regexes?
Python re
module is not that rich as Boost regexps (suffice is to mention such constructs as \h
, \G
, \K
, \R
, \X
, \Q...\E
, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern)
is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i)
in your &|&#((?i)x26);|&
will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).
Also, same as in Boost, you do not have to escape [
inside a character class, and {
outside the character class.
The backreferences like \1
are the same as in Python.
Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')
), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).
Note the difference in named group definition: (?<NAME>expression)
/(?'NAME'expression)
in Boost, and (?P<NAME>expression)
in Python.
I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋
). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.
The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.
Now, as for the remaining questions:
in boost, there's
boost::u32regex_match
, is that the same asre.match
?
The re.match
is not the same as regex_match as re.match
is looking for the match at the beginning of the string, and regex_match
requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0)
that is equivalent to Boost regex_match
.
in boost, there's
boost::u32regex_search
, how is it different tore.search
Whenver you need to find a match anywhere inside a string, you need to use re.search
(see match()
versus search()
). Thus, this method provides analoguous functionality as regex_search
does in Boost.
there's also
boost::format_perl
andboost::match_default
andboost::smatch
, what are their equivalence in pythonre
?
Python does not support Perl-like expressions to the extent Boost can, Python re
module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default
or perl
can be found there. As for the smatch
, you can use re.finditer
to get all the match objects. A re.findall
returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall
/re.finditer
difference.
And in the conclusion, a must-read article Python’s re Module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With