Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python splitting string by parentheses

I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.

If the user types this:

new test (test1 test2 test3) test "test5 test6"

I would like it to look like the output to the variable like this:

["new", "test", "test1 test2 test3", "test", "test5 test6"]

In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.

I currently am using this code which does not meet the above standard (From the answers in the link above):

>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']

This works well but there is a problem, if you have this:

strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"

It combines the Hello and Test as one split instead of two.

It also doesn't allow the use of parentheses and quotation marks splitting at the same time.

like image 527
TrevorPeyton Avatar asked Jun 27 '13 20:06

TrevorPeyton


2 Answers

The answer was simply:

re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
like image 95
TrevorPeyton Avatar answered Oct 13 '22 00:10

TrevorPeyton


This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:

from pyparsing import *
import string, re

RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord | 
           Group('"' + OneOrMore(RawWord) + '"') |
           Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)

Phrase.parseString(s, parseAll=True)

This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.

I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.

like image 32
dspeyer Avatar answered Oct 12 '22 23:10

dspeyer