Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes the '' in ['h', 'e', 'l', 'l', 'o', ''] when you do re.findall('[\w]?', 'hello')

Tags:

What causes the '' in ['h', 'e', 'l', 'l', 'o', ''] when you do re.findall('[\w]?', 'hello'). I thought the result would be ['h', 'e', 'l', 'l', 'o'], without the last empty string.

like image 230
Wagdet Avatar asked Jan 18 '16 17:01

Wagdet


People also ask

What does regex Findall return?

findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.

What is re Findall () in Python?

In this article, we will learn how to find all matches to the regular expression in Python. The RE module's re. findall() method scans the regex pattern through the entire target string and returns all the matches that were found in the form of a list.

What is difference between Search () and Findall () methods in Python?

Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.


2 Answers

The question mark in your regex ('[\w]?') is responsible for the empty string being one of the returned results.

A question mark is a quantifier meaning "zero-or-one matches." You are asking for all occurrences of either zero-or-one "word characters". The letters satisfy the "-or-one word characters" match. The empty string satisfies the “zero word characters” match condition.

Change your regex to '\w' (remove the question mark and superfluous character class brackets) and the output will be as you expect.

like image 118
Michael Burjack Avatar answered Oct 13 '22 22:10

Michael Burjack


Regexes search through strings one character at a time. If a match is found at a character position the regex advances to the next part of the pattern. If a match is not found, the regex tries alternation (different variations) if available. If all alternatives fail, it backtracks and tries alternating the previous part and so on until either an entire match is found or all alternatives fail. This is why some seemingly simple regexes will match a string quickly, but fail to match in exponential time. In your example you only have one part to your pattern.

You are searching for [\w]?. The ? means "one or zero of prior part" and is equivalent to {0,1}. Each of 'h', 'e', 'l', 'l' & 'o' matches [\w]{1}, so the pattern advances and completes for each letter, restarting the regex at the beginning because you asked for all the matches, not just the first. At the end of the string the regex is still trying to find a match. [\w]{1} no longer matches but the alternative [\w]{0} does, so it matches ''. Modern regex engines have a rule to stop zero-length matches from repeating at the same position. The regex tries again, but this time fails because it can't find a match for [\w]{1} and it has already found a match for [\w]{0}. It can't advance through the string because it is at the end, so it exits. It has run the pattern 7 times and found 6 matches, the last one of which was empty.

As pointed out in a comment, if your regex was \w?? (I've removed [ and ] because they aren't necessary in your original regex), it means find zero or one (note the order has changed from before). It will return '', 'h', '', 'e', '', 'l', '', 'l', '', 'o' & ''. This is because it now prefers to find zero but it can't find two zero-length matches in a row without advancing.

like image 30
CJ Dennis Avatar answered Oct 14 '22 00:10

CJ Dennis