Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex nested parenthesis in python

Tags:

python

regex

text

I have something like this:

Othername California (2000) (T) (S) (ok) {state (#2.1)}

Is there a regex code to obtain:

Othername California ok 2.1

I.e. I would like to keep the numbers within round parenthesis which are in turn within {} and keep the text "ok" which is within (). I specifically need the string "ok" to be printed out, if included in my lines, but I would like to get rid of other text within parenthesis eg (V), (S) or (2002).

I am aware that probably regex is not the most efficient way to handle such a problem.

Any help would be appreciated.

EDIT:

The string may vary since if some information is unavailable is not included in the line. Also the text itself is mutable (eg. I don't have "state" for every line). So one can have for example:

Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
like image 211
user2447387 Avatar asked Jun 18 '13 08:06

user2447387


2 Answers

Regex

(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}

Regular expression image

Text used for test

Name1 Name2 Name3 (2000) {Education (#3.2)}
Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
Othername California (2000) (T) (S) (ok) {state (#2.1)}

Test

>>> regex = re.compile("(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x54e2105f36c16a48>
>>> regex.match(string)
<_sre.SRE_Match object at 0x54e2105f36c169e8>

# Run findall
>>> regex.findall(string)
[
   (u'Name1 Name2 Name3'   , u''  , u'3.2'),
   (u'Name1 Name2 Name3'   , u'ok', u'1.1'),
   (u'Name1 Name2'         , u''  , u'1.1'),
   (u'Name1 Name2 Name3'   , u''  , u'4.12'),
   (u'Othername California', u'ok', u'2.1')
]
like image 146
Stephan Avatar answered Sep 28 '22 22:09

Stephan


Try this one:

import re

thestr = 'Othername California (2000) (T) (S) (ok) {state (#2.1)}'

regex = r'''
    ([^(]*)             # match anything but a (
    \                   # a space
    (?:                 # non capturing parentheses
        \([^(]*\)       # parentheses
        \               # a space
    ){3}                # three times
    \(([^(]*)\)         # capture fourth parentheses contents
    \                   # a space
    {                   # opening {
        [^}]*           # anything but }
        \(\#            # opening ( followed by #
            ([^)]*)     # match anything but )
        \)              # closing )
    }                   # closing }
'''

match = re.match(regex, thestr, re.X)

print match.groups()

Output:

('Othername California', 'ok', '2.1')

And here's the compressed version:

import re

thestr = 'Othername California (2000) (T) (S) (ok) {state (#2.1)}'
regex = r'([^(]*) (?:\([^(]*\) ){3}\(([^(]*)\) {[^}]*\(\#([^)]*)\)}'
match = re.match(regex, thestr)

print match.groups()
like image 44
gitaarik Avatar answered Sep 28 '22 23:09

gitaarik