Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regular expression pattern * is not working as expected

Tags:

python

regex

While working through Google's 2010 Python class, I found the following documentation:

'*' -- 0 or more occurrences of the pattern to its left

But when I tried the following

re.search(r'i*','biiiiiiiiiiiiiig').group() 

I expected 'iiiiiiiiiiiiii' as output but got ''. Why?

like image 536
user1423015 Avatar asked Jan 10 '15 13:01

user1423015


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

How do I match a pattern in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you replace a pattern in Python?

To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.


1 Answers

* means 0 or more but re.search would return only the first match. Here the first match is an empty string. So you get an empty string as output.

Change * to + to get the desired output.

>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i+','biiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'

Consider this example.

>>> re.search(r'i*','biiiiiiiiiiiiiig').group()
''
>>> re.search(r'i*','iiiiiiiiiiiiiig').group()
'iiiiiiiiiiiiii'

Here i* returns iiiiiiiiiiiiii because at first , the regex engine tries to match zero or more times of i. Once it finds i at the very first, it matches greedily all the i's like in the second example, so you get iiiiiiii as output and if the i is not at the first (consider this biiiiiiig string), i* pattern would match all the empty string before the every non-match, in our case it matches all the empty strings that exists before b and g. Because re.search returns only the first match, you should get an empty string because of the non-match b at the first.

Why i got three empty strings as output in the below example?

>>> re.findall(r'i*','biiiiiiiiiiiiiig')
['', 'iiiiiiiiiiiiii', '', '']

As i explained earlier, for every non-match you should get an empty string as match. Let me explain. Regex engine parses the input from left to right.

  1. First empty string as output is because the pattern i* won't match the character b but it matches the empty string which exists before the b.

  2. Now the engine moves to the next character that is i which would be matched by our pattern i*, so it greedily matches the following i's . So you get iiiiiiiiiiiiii as the second.

  3. After matching all the i's, it moves to the next character that is g which isn't matched by our pattern i* . So i* matches the empty string before the non-match g. That's the reason for the third empty string.

  4. Now our pattern i* matches the empty string which exists before the end of the line. That's the reason for fourth empty string.

like image 125
Avinash Raj Avatar answered Oct 26 '22 11:10

Avinash Raj