Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split by commas that are not within parentheses?

Tags:

python

regex

Say I have a string like this, where items are separated by commas but there may also be commas within items that have parenthesized content:

(EDIT: Sorry, forgot to mention that some items may not have parenthesized content)

"Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

How can I split the string by only those commas that are NOT within parentheses? i.e:

["Water", "Titanium Dioxide (CI 77897)", "Black 2 (CI 77266)", "Iron Oxides (CI 77491, 77492, 77499)", "Ultramarines (CI 77007)"]

I think I'd have to use a regex, perhaps something like this:

([(]?)(.*?)([)]?)(,|$)

but I'm still trying to make it work.

like image 752
bard Avatar asked Oct 29 '14 14:10

bard


3 Answers

Use a negative lookahead to match all the commas which are not inside the parenthesis. Splitting the input string according to the matched commas will give you the desired output.

,\s*(?![^()]*\))

DEMO

>>> import re
>>> s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>>> re.split(r',\s*(?![^()]*\))', s)
['Water', 'Titanium Dioxide (CI 77897)', 'Black 2 (CI 77266)', 'Iron Oxides (CI 77491, 77492, 77499)', 'Ultramarines (CI 77007)']
like image 58
Avinash Raj Avatar answered Nov 14 '22 17:11

Avinash Raj


You can just do it using str.replace and str.split. You may use any character to replace ),.

a = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
a = a.replace('),', ')//').split('//')
print a

output:-

['Titanium Dioxide (CI 77897)', ' Black 2 (CI 77266)', ' Iron Oxides (CI 77491, 77492, 77499)', ' Ultramarines (CI 77007)']
like image 1
Vishnu Upadhyay Avatar answered Nov 14 '22 15:11

Vishnu Upadhyay


I believe I have a simpler regexp for this:

rx_comma = re.compile(r",(?![^(]*\))")
result = rx_comma.split(string_to_split)

Explanation of the regexp:

  • Match , that:
  • Is NOT followed by:
    • A list of characters ending with ), where:
    • A list of characters between , and ) does not contain (

It will not work in case of nested parentheses, like a,b(c,d(e,f)). If one needs this, a possible solution is to go through a result of split and in case of strings having an open parentheses without closing, do a merge :), like:

"a"
"b(c" <- no closing, merge this 
"d(e" <- no closing, merge this
"f))
like image 1
Marcin Avatar answered Nov 14 '22 15:11

Marcin