Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

Tags:

python

regex

In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

Actual Output is:

['ab', 'cd']
['cd']

I'm confused as to why does the second result doesn't contain 'ab' as well?

like image 349
rock Avatar asked Jan 07 '20 08:01

rock


People also ask

What is the difference between re search and re Findall?

The re.It searches from start or end of the given string. If we use method findall to search for a pattern in a given string it will return all occurrences of the pattern. While searching a pattern, it is recommended to use re. findall() always, it works like re.search() and re.

What does the function re Findall regex any string do?

The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.

What is difference between Search () and Findall () methods in Python?

Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.

Is there any difference between re match () and re search () in the Python re module?

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.


1 Answers

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.


If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

like image 167
Shashank V Avatar answered Oct 04 '22 20:10

Shashank V