In a Python regular expression, I encounter this singular problem.
Could you give instruction on the differences between re.findall('(ab|cd)', string)
and re.findall('(ab|cd)+', string)
?
import re
string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)
Actual Output is:
['ab', 'cd']
['cd']
I'm confused as to why does the second result doesn't contain 'ab'
as well?
The re.It searches from start or end of the given string. If we use method findall to search for a pattern in a given string it will return all occurrences of the pattern. While searching a pattern, it is recommended to use re. findall() always, it works like re.search() and re.
The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.
Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.
+
is a repeat quantifier that matches one or more times. In the regex (ab|cd)+
, you are repeating the capture group (ab|cd)
using +. This will only capture the last iteration.
You can reason about this behaviour as follows:
Say your string is abcdla
and regex is (ab|cd)+
. Regex engine will find a match for the group between positions 0 and 1 as ab
and exits the capture group. Then it sees +
quantifier and so tries to capture the group again and will capture cd
between positions 2 and 3.
If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+)
which matches abcd
and cd
. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+)
which matches abcd
https://www.regular-expressions.info/captureall.html
From the Docs,
Let’s say you want to match a tag like
!abc!
or!123!
. Only these two are possible, and you want to capture theabc
or123
to figure out which tag you got. That’s easy enough:!(abc|123)!
will do the trick.Now let’s say that the tag can contain multiple sequences of
abc
and123
, like!abc123!
or!123abcabc!
. The quick and easy solution is!(abc|123)+!
. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches!abc123!
, the capturing group stores only123
. When it matches!123abcabc!
, it only storesabc
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With