I was looking at the responses to this earlier-asked question:
Split Strings with Multiple Delimiters?
For my variant of this problem, I wanted to split on everything that wasn't from a specific set of chars. Which led me to a solution I liked, until I found this apparent bug. Is this a bug or some quirk of python I'm unfamiliar with?
>>> b = "Which_of'these-markers/does,it:choose to;split!on?"
>>> b1 = re.split("[^a-zA-Z0-9_'-/]+", b)
>>> b1
["Which_of'these-markers/does,it", 'choose', 'to', 'split', 'on', '']
I'm not understanding why it doesn't split on a comma (','), given that a comma is not in my exception list?
The '-/
inside a character class created a range that includes a comma:
When you need to put a literal hyphen in a Python re
pattern, put it:
[-A-Z]
(matches an uppercase ASCII letter and -
)[A-Z()-]
(matches an uppercase ASCII letter, (
, )
or -
)[A-Z-+]
(matches an uppercase ASCII letter, -
or +
)You cannot put it after a shorthand, right before a standalone symbol (as in [\w-+]
, it will cause a bad character range error). This is valid in .NET and some other regex flavors, but is not valid in Python re
.
Put the hyphen at the end of it, or escape it.
Use
re.split(r"[^a-zA-Z0-9_'/-]+", b)
In Python 2.7, you may even contract it to
re.split(r"[^\w'/-]+", b)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With