Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python split on multiple delimiters bug?

I was looking at the responses to this earlier-asked question:

Split Strings with Multiple Delimiters?

For my variant of this problem, I wanted to split on everything that wasn't from a specific set of chars. Which led me to a solution I liked, until I found this apparent bug. Is this a bug or some quirk of python I'm unfamiliar with?

>>> b = "Which_of'these-markers/does,it:choose to;split!on?"
>>> b1 = re.split("[^a-zA-Z0-9_'-/]+", b)
>>> b1
["Which_of'these-markers/does,it", 'choose', 'to', 'split', 'on', '']

I'm not understanding why it doesn't split on a comma (','), given that a comma is not in my exception list?

like image 433
dipankar Avatar asked Jan 04 '23 04:01

dipankar


1 Answers

The '-/ inside a character class created a range that includes a comma:

enter image description here

When you need to put a literal hyphen in a Python re pattern, put it:

  • at the start: [-A-Z] (matches an uppercase ASCII letter and -)
  • at the end: [A-Z()-] (matches an uppercase ASCII letter, (, ) or -)
  • after a valid range: [A-Z-+] (matches an uppercase ASCII letter, - or +)
  • or just escape it.

You cannot put it after a shorthand, right before a standalone symbol (as in [\w-+], it will cause a bad character range error). This is valid in .NET and some other regex flavors, but is not valid in Python re.

Put the hyphen at the end of it, or escape it.

Use

re.split(r"[^a-zA-Z0-9_'/-]+", b)

In Python 2.7, you may even contract it to

re.split(r"[^\w'/-]+", b)
like image 170
Wiktor Stribiżew Avatar answered Jan 13 '23 15:01

Wiktor Stribiżew