With the re
module, it seems that I am unable to split on pattern matches that are empty strings:
>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']
In other words, even if a match is found, if it's the empty string, even re.split
cannot split the string.
The docs for re.split
seem to support my results.
A "workaround" was easy enough to find for this particular case:
>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']
But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on:
>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']
Is there any better way to split on an empty pattern match with the re
module? Additionally, why does re.split
not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in String.prototype.split()
.
split() and rsplit() split only when sep matches completely. If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.
The split() method does not change the value of the original string. If the delimiter is an empty string, the split() method will return an array of elements, one element for each character of string. If you specify an empty string for string, the split() method will return an empty string and not an array of strings.
Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting.
sub() method will replace all pattern occurrences in the target string.
import regex
x="bazbarbarfoobar"
print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)
You can use regex
module here for this.
or
(.+?(?<!foo))(?=bar|$)|(.+?foo)$
Use re.findall
.
See demo
It is unfortunate that the split
requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*
as the regex. Use of such patterns will now generate a FutureWarning
and those that never can split anything, throw a ValueError
from Python 3.5 onwards:
>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/re.py", line 212, in split
return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.
The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.
If you can't use the regex
module, you can write your own split function using re.finditer()
:
def megasplit(pattern, string):
splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
starts = [0] + [i[1] for i in splits]
ends = [i[0] for i in splits] + [len(string)]
return [string[start:end] for start, end in zip(starts, ends)]
print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))
If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:
import re
def zerowidthsplit(pattern, string):
splits = list(m.start() for m in re.finditer(pattern, string))
starts = [0] + splits
ends = splits + [ len(string) ]
return [string[start:end] for start, end in zip(starts, ends)]
print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With