I want to split a string using -
, +=
, ==
, =
, +
, and white-space as delimiters. I want to keep the delimiter unless it is white-space.
I've tried to achieve this with the following code:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
return pattern.split(s)
print(tokenize("hello-+==== =+ there"))
I expected the output to be
['hello', '-', '+=', '==', '=', '=', '+', 'there']
However I got
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
Which is almost what I wanted, except that there are quite a few extraneous None
s and empty strings.
Why is it behaving this way, and how might I change it to get what I want?
re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)
['hello', '', '', '', '', '', '', '', 'there']
Note the empty strings in between -
and +=
, +=
and ==
, etc.
As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+)
instead of (?:\-|\+\=|\=\=|\=|\+)
, the bits that the capture group matches are interspersed:
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
None
corresponds to where the \s+
half of your pattern was matched; in those cases, the capture group captured nothing.
From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard None
s and empty strings:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
return [ x for x in pattern.split(s) if x ]
One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.
Why is it behaving this way?
According to the documentation for re.split:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
This is literally correct: if capturing parentheses are used, then the text of all groups are returned, whether or not they matched anything; the ones which didn't match anything return None
.
As always with split
, two consecutive delimiters are considered to separate empty strings, so you get empty strings interspersed.
how might I change it to get what I want?
The simplest solution is to filter the output:
filter(None, pattern.split(s))
Perhaps re.findall
would be more suitable for you?
>>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+ there")
['hello', '-', '+=', '==', '=', '=', '+', 'there']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With