Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Peculiar working of '*' in regex expression

While writing a regex pattern for substituting all the continuos '1's and single '1's as 's'. I found this quite confusing ,usage of '+' (used for matching 1 or more ) gave expected result, but '*' gave strange result

>>> l='100'
>>> import re
>>> j=re.compile(r'(1)*')    
>>> m=j.sub('*',l)
>>> m
'*0*0*'

While usage of '+' gave expected result.

>>> l='100'
>>> j=re.compile(r'1+')
>>> m=j.sub('*',l)
>>> m
'*00'

how does '*' in regex gives this, while its behaviour is to match 0 or more.

like image 272
dodo Avatar asked Jun 01 '17 19:06

dodo


People also ask

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How does regex handle special characters?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What are different types of regular expression?

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "basic" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.

How does regex work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.


1 Answers

(1)* means "match 0 or more 1's". So for 100 it matches the 1, the empty string between 0 and 0, and the empty string after the last 0. You then replace the empty strings with '*'. 1+ requires at least one 1 in the match, so it won't match the boundary between characters.

For those readers curious, yes the python output is *0*0* and not **0*0*. Here is a test python script to play with. (Regex101 has the wrong output for this, because it does not use an actual python regex engine. Online Regex testers will usually use PCRE (which is provided in PHP and Apache HTTP Server), and fake the target regex engine. Always test your regex in live code!)

Here you can see in JavaScript the output will be **0*0* (it will match the empty string between 1 and 0 as a new match) This is a prime example of why 'regex flavor' is important. Different engines use slightly different rules. (in this case, if the new match starts at 0 or the character boundary)

console.log("100".replace(/(1)*/g, '*'))
like image 50
Tezra Avatar answered Oct 09 '22 22:10

Tezra