I got a string of such format: <pre class="prettyprint"><code>"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" </code></pre> so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so). My goal is to split this string into a list of pairs - <code>(actor name, actor role)</code>. One obvious solution would be to go through each character, check for occurances of <code>'('</code>, <code>')'</code> and <code>','</code> and split it whenever a comma outside occures. But this seems a bit heavy... I was thinking about spliting it using a regexp: first split the string by parenthesis: <pre class="prettyprint"><code>import re x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" s = re.split(r'[()]', x) # ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', ''] </code></pre> The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach. Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

<pre class="prettyprint"><code>s = re.split(r',\s*(?=[^)]*(?:\(|$))', x) </code></pre> The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.

How to split a string by commas positioned outside of parenthesis?

Tags:

I got a string of such format:

"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"

so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).

My goal is to split this string into a list of pairs - (actor name, actor role).

One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...

I was thinking about spliting it using a regexp: first split the string by parenthesis:

import re x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" s = re.split(r'[()]', x)  # ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']

The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.

Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

775

asked Oct 30 '09 08:10

kender

2 Answers

s = re.split(r',\s*(?=[^)]*(?:\(|$))', x)

The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.

answered Oct 29 '22 10:10

Alan Moore

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" >>> r = re.compile(r'(?:[^,(]|\([^)]*\))+') >>> r.findall(s) ['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:

non-comma, non-open-paren characters
strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:

"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

171

answered Oct 29 '22 10:10

Laurence Gonsalves

Related questions
                            
                                How to use Ajax JQuery in Spring Web MVC
                            
                                Proportionately distribute (prorate) a value across a set of values
                            
                                The HTTP request is unauthorized with client authentication scheme 'Basic'. The authentication header received from the server was 'Basic realm="pc"'
                            
                                How to detect the active iTunes store on the iPhone/iPod Touch/iPad?
                            
                                Is there a list of changes for C#4.0 that work in .Net 3.5?
                            
                                Can I change the Bundle Identifier in my app after it's been approved?
                            
                                Group MySQL Data into Arbitrarily Sized Time Buckets
                            
                                Defining InputBindings within a Style
                            
                                Java class object from type variable
                            
                                Can I use my ssh-public-key to decrypt a file?
                            
                                What is the @ (at) symbol used for in XSLT?
                            
                                when to index on multiple keys in mongodb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With