With the <code>re</code> module, it seems that I am unable to split on pattern matches that are empty strings: <pre class="prettyprint"><code>>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar') ['foobarbarbazbar'] </code></pre> In other words, even if a match is found, if it's the empty string, even <code>re.split</code> cannot split the string. The docs for <code>re.split</code> seem to support my results. A "workaround" was easy enough to find for this particular case: <pre class="prettyprint"><code>>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux') ['foobar', 'barbaz', 'bar'] </code></pre> But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on: <pre class="prettyprint"><code>>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux') ['foobar', 'bar', '', 'bar'] </code></pre> Is there any better way to split on an empty pattern match with the <code>re</code> module? Additionally, why does <code>re.split</code> not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in <code>String.prototype.split()</code>.

<pre class="prettyprint"><code>import regex x="bazbarbarfoobar" print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1) </code></pre> You can use <code>regex</code> module here for this. or <pre class="prettyprint"><code>(.+?(?<!foo))(?=bar|$)|(.+?foo)$ </code></pre> Use <code>re.findall</code> . See demo

It is unfortunate that the <code>split</code> requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example <code>[something]*</code>as the regex. Use of such patterns will now generate a <code>FutureWarning</code> and those that never can split anything, throw a <code>ValueError</code> from Python 3.5 onwards: <pre class="prettyprint"><code>>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.6/re.py", line 212, in split return _compile(pattern, flags).split(string, maxsplit) ValueError: split() requires a non-empty pattern match. </code></pre> The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again. <hr> If you can't use the <code>regex</code> module, you can write your own split function using <code>re.finditer()</code>: <pre class="prettyprint"><code>def megasplit(pattern, string): splits = list((m.start(), m.end()) for m in re.finditer(pattern, string)) starts = [0] + [i[1] for i in splits] ends = [i[0] for i in splits] + [len(string)] return [string[start:end] for start, end in zip(starts, ends)] print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar')) print(megasplit(r'o', 'foobarbarbazbar')) </code></pre> If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code: <pre class="prettyprint"><code>import re def zerowidthsplit(pattern, string): splits = list(m.start() for m in re.finditer(pattern, string)) starts = [0] + splits ends = splits + [ len(string) ] return [string[start:end] for start, end in zip(starts, ends)] print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar')) </code></pre>

Python regex: splitting on pattern match that is an empty string

Tags:

python

string

regex

split

With the re module, it seems that I am unable to split on pattern matches that are empty strings:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']

In other words, even if a match is found, if it's the empty string, even re.split cannot split the string.

The docs for re.split seem to support my results.

A "workaround" was easy enough to find for this particular case:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']

But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']

Is there any better way to split on an empty pattern match with the re module? Additionally, why does re.split not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in String.prototype.split().

989

asked May 01 '15 14:05

Shashank

2 Answers

import regex
x="bazbarbarfoobar"
print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)

You can use regex module here for this.

(.+?(?<!foo))(?=bar|$)|(.+?foo)$

Use re.findall .

See demo

190

answered Oct 12 '22 10:10

vks

It is unfortunate that the split requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*as the regex. Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.

The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.

If you can't use the regex module, you can write your own split function using re.finditer():

def megasplit(pattern, string):
    splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
    starts = [0] + [i[1] for i in splits]
    ends = [i[0] for i in splits] + [len(string)]
    return [string[start:end] for start, end in zip(starts, ends)]

print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))

If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:

import re

def zerowidthsplit(pattern, string):
    splits = list(m.start() for m in re.finditer(pattern, string))
    starts = [0] + splits
    ends = splits + [ len(string) ]
    return [string[start:end] for start, end in zip(starts, ends)]

print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))

answered Oct 12 '22 10:10

Antti Haapala -- Слава Україні

Related questions
                            
                                python Convert Encoding:LookupError: unknown encoding: ansi
                            
                                In numpy, calculating a matrix where each cell contains the product of all the other entries in that row
                            
                                How to restart a python script after it finishes
                            
                                pandas read json not working on MultiIndex
                            
                                Select two random rows from numpy array
                            
                                How to do logging at function entry, inside and exit in Python
                            
                                unittest.py doesn't play well with trace.py - why?
                            
                                Flask: Background thread sees a non-empty queue as empty
                            
                                Python dictionary lookup speed with NumPy datatypes
                            
                                pandas: How to find the max n values for each category in a column
                            
                                What is the equivalent php structure to python's dictionary?
                            
                                How to reconnect a socket on asyncio?
                            
                                Specify Compression Quality in Python for OpenCV Video Object
                            
                                Asyncio calls running in Gtk main loop
                            
                                Can anyone identify this encoding?
                            
                                Cartopy: order of rendering layers with scatter data
                            
                                LDA ignoring n_components?
                            
                                'str' object has no attribute 'decode'
                            
                                Prediction in Caffe - Exception: Input blob arguments do not match net inputs
                            
                                Dynamodb: query using more than two attributes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With