Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re.findall() is not working as expected

Tags:

python

regex

I have the code:

import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)

This returns ['aa']

If we have

import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)

we get ['aa','aa']

Why is there a difference and why (for the first) do we not get ['aa','bb','aa']?

Thanks!

like image 632
Dale Myers Avatar asked Oct 21 '12 15:10

Dale Myers


People also ask

How does Findall work Python?

How Does the findall() Method Work in Python? The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.

How does regex Findall work?

The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.

What does regex Findall return?

findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.

What is the difference between Findall and Finditer in Python?

But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.


2 Answers

The unwanted behaviour comes down to the way you formulate regualar expression:

rexp=re.compile("(aa|bb)+")

Parentheses (aa|bb) forms a group.

And if we look at the docs of findall we will see this:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.**

As you formed a group, it mathced first aa, then bb, then aa again (because of + quantifier). So this group holds aa in the end. And findall returns this value in the list ['aa'] (as there is only one match aabbaa of the whole expression, the list contains only one element aa which is saved in the group).

From the code you gave, you seemed to want to do this:

>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']

(?: ...) doesnt create any group, so findall returns the match of the whole expression.

In the end of your question you show the desired output. This is achieved by just looking for aa or bb. No quantifiers (+ or *) are needed. Just do it the way is in the Inbar Rose's answer:

>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']
like image 55
ovgolovin Avatar answered Oct 12 '22 16:10

ovgolovin


let me explain what you are doing:

regex = re.compile("(aa|bb)+")

you are creating a regex which will look for aa or bb and then will try to find if there are more aa or bb after that, and it will keep looking for aa or bb until it doesnt find. since you want your capturing group to return only the aa or bb then you only get the last captured/found group.

however, if you have a string like this: aaxaabbxaa you will get aa,bb,aa because you first look at the string and find aa, then you look for more, and find only an x, so you have 1 group. then you find another aa, but then you find a bb, and then an x so you stop and you have your second group which is bb. then you find another aa. and so your final result is aa,bb,aa

i hope this explains what you are DOING. and it is as expected. to get ANY group of aa or bb you need to remove the + which is telling the regex to seek multiple groups before returning a match. and just have regex return each match of aa or bb...

so your regex should be:

regex = re.compile("(aa|bb)")

cheers.

like image 26
Inbar Rose Avatar answered Oct 12 '22 16:10

Inbar Rose