What causes the ''
in ['h', 'e', 'l', 'l', 'o', '']
when you do re.findall('[\w]?', 'hello')
. I thought the result would be ['h', 'e', 'l', 'l', 'o']
, without the last empty string.
findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.
In this article, we will learn how to find all matches to the regular expression in Python. The RE module's re. findall() method scans the regex pattern through the entire target string and returns all the matches that were found in the form of a list.
Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.
The question mark in your regex ('[\w]?'
) is responsible for the empty string being one of the returned results.
A question mark is a quantifier meaning "zero-or-one matches." You are asking for all occurrences of either zero-or-one "word characters". The letters satisfy the "-or-one word characters" match. The empty string satisfies the “zero word characters” match condition.
Change your regex to '\w'
(remove the question mark and superfluous character class brackets) and the output will be as you expect.
Regexes search through strings one character at a time. If a match is found at a character position the regex advances to the next part of the pattern. If a match is not found, the regex tries alternation (different variations) if available. If all alternatives fail, it backtracks and tries alternating the previous part and so on until either an entire match is found or all alternatives fail. This is why some seemingly simple regexes will match a string quickly, but fail to match in exponential time. In your example you only have one part to your pattern.
You are searching for [\w]?
. The ?
means "one or zero of prior part" and is equivalent to {0,1}
. Each of 'h'
, 'e'
, 'l'
, 'l'
& 'o'
matches [\w]{1}
, so the pattern advances and completes for each letter, restarting the regex at the beginning because you asked for all the matches, not just the first. At the end of the string the regex is still trying to find a match. [\w]{1}
no longer matches but the alternative [\w]{0}
does, so it matches ''
. Modern regex engines have a rule to stop zero-length matches from repeating at the same position. The regex tries again, but this time fails because it can't find a match for [\w]{1}
and it has already found a match for [\w]{0}
. It can't advance through the string because it is at the end, so it exits. It has run the pattern 7 times and found 6 matches, the last one of which was empty.
As pointed out in a comment, if your regex was \w??
(I've removed [
and ]
because they aren't necessary in your original regex), it means find zero or one (note the order has changed from before). It will return ''
, 'h'
, ''
, 'e'
, ''
, 'l'
, ''
, 'l'
, ''
, 'o'
& ''
. This is because it now prefers to find zero but it can't find two zero-length matches in a row without advancing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With