Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does re.findall return a list of tuples when my pattern only contains one group?

Say I have a string s containing letters and two delimiters 1 and 2. I want to split the string in the following way:

  • if a substring t falls between 1 and 2, return t
  • otherwise, return each character

So if s = 'ab1cd2efg1hij2k', the expected output is ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k'].

I tried to use regular expressions:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(1([a-z]+)2|[a-z])', s )

[('a', ''),
 ('b', ''),
 ('1cd2', 'cd'),
 ('e', ''),
 ('f', ''),
 ('g', ''),
 ('1hij2', 'hij'),
 ('k', '')]

From there i can do [ x[x[-1]!=''] for x in re.findall( r'(1([a-z]+)2|[a-z])', s ) ] to get my answer, but I still don't understand the output. The documentation says that findall returns a list of tuples if the pattern has more than one group. However, my pattern only contains one group. Any explanation is welcome.

like image 216
usual me Avatar asked Jul 06 '14 07:07

usual me


2 Answers

If you want to have an 'or' match without having the split into match groups just add a '?:' to the beginning of the 'or' match.

Without '?:'

re.findall('(test (word1|word2))', 'test word1')

Output:
[('test word1', 'word1')]

With '?:'

re.findall('(test (?:word1|word2))', 'test word1')

Output:
['test word1']

Further explanation: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

like image 140
Sebastian N Avatar answered Oct 02 '22 11:10

Sebastian N


I am 5 years too late to the party, but I think I might have found an elegant solution to the re.findall() ugly tuple-ridden output with multiple capture groups.

In general, if you end up with an output which looks something like that:

[('pattern_1', '', ''), ('', 'pattern_2', ''), ('pattern_1', '', ''), ('', '', 'pattern_3')]

Then you can bring it into a flat list with this little trick:

["".join(x) for x in re.findall(all_patterns, iterable)]

The expected output will be like so:

['pattern_1', 'pattern_2', 'pattern_1', 'pattern_3']

It was tested on Python 3.7. Hope it helps!

like image 29
Greem666 Avatar answered Oct 02 '22 11:10

Greem666