In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between <code>re.findall('(ab|cd)', string)</code> and <code>re.findall('(ab|cd)+', string)</code>? <pre class="prettyprint lang-py prettyprint-override"><code>import re string = 'abcdla' result = re.findall('(ab|cd)', string) result2 = re.findall('(ab|cd)+', string) print(result) print(result2) </code></pre> Actual Output is: <pre class="prettyprint"><code>['ab', 'cd'] ['cd'] </code></pre> I'm confused as to why does the second result doesn't contain <code>'ab'</code> as well?

<code>+</code> is a repeat quantifier that matches one or more times. In the regex <code>(ab|cd)+</code>, you are repeating the capture group <code>(ab|cd)</code> using +. This will only capture the last iteration. You can reason about this behaviour as follows: Say your string is <code>abcdla</code> and regex is <code>(ab|cd)+</code>. Regex engine will find a match for the group between positions 0 and 1 as <code>ab</code> and exits the capture group. Then it sees <code>+</code> quantifier and so tries to capture the group again and will capture <code>cd</code> between positions 2 and 3. <hr> If you want to capture all iterations, you should capture the repeating group instead with <code>((ab|cd)+)</code> which matches <code>abcd</code> and <code>cd</code>. You can make the inner group non-capturing as we don't care about inner group matches with <code>((?:ab|cd)+)</code> which matches <code>abcd</code> https://www.regular-expressions.info/captureall.html From the Docs, <blockquote> Let’s say you want to match a tag like <code>!abc!</code> or <code>!123!</code>. Only these two are possible, and you want to capture the <code>abc</code> or <code>123</code> to figure out which tag you got. That’s easy enough: <code>!(abc|123)!</code> will do the trick. Now let’s say that the tag can contain multiple sequences of <code>abc</code> and <code>123</code>, like <code>!abc123!</code> or <code>!123abcabc!</code>. The quick and easy solution is <code>!(abc|123)+!</code>. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches <code>!abc123!</code>, the capturing group stores only <code>123</code>. When it matches <code>!123abcabc!</code>, it only stores <code>abc</code>. </blockquote>

re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

Tags:

python

regex

In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

Actual Output is:

['ab', 'cd']
['cd']

I'm confused as to why does the second result doesn't contain 'ab' as well?

349

asked Jan 07 '20 08:01

rock

1 Answers

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.

If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

167

answered Oct 04 '22 20:10

Shashank V

Related questions
                            
                                How to eliminate the extra minus sign when rounding negative numbers towards zero in numpy?
                            
                                Find out which font matplotlib uses
                            
                                Why does PyMongo throw AutoReconnect?
                            
                                Pandas MultiIndex: Divide all columns by one column
                            
                                Clustering cosine similarity matrix
                            
                                Why does CalibratedClassifierCV underperform a direct classifer?
                            
                                Merge Only When Value is Empty/Null in Pandas
                            
                                Cyclic shift of a pandas series
                            
                                Why is psycopg2 IntegrityError not being caught?
                            
                                Spline with constraints at border
                            
                                pip broken, reinstall doesn't work. EC2
                            
                                How to store scaling parameters for later use
                            
                                Python mock.patch: replace a method
                            
                                ValueError: day is out of range for month
                            
                                How can I create an in-memory database with sqlite?
                            
                                How can I download the chat history of a group in Telegram?
                            
                                How are python's unpacking operators * and ** used?
                            
                                Flatten numpy array with sub-arrays of different dimensions
                            
                                Difference between Context Managers and Decorators in Python
                            
                                Poetry and PyTorch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With