Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to extract information from a string

Tags:

python

regex

This is a follow-up and complication to this question: Extracting contents of a string within parentheses.

In that question I had the following string --

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

And I wanted to get a list of tuples in the form of (actor, character) --

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

To generalize matters, I have a slightly more complicated string, and I need to extract the same information. The string I have is --

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"

I need to format this as follows:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

I know I can replace the filler words (with, and, &, etc.), but can't quite figure out how to add a blank entry -- '' -- if there is no character name for the actor (in this case Stephen Root). What would be the best way to go about doing this?

Finally, I need to take into account if an actor has multiple roles, and build a tuple for each role the actor has. The final string I have is:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"

And I need to build a list of tuples as follows:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),    
 ('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]

Thank you.

like image 652
David542 Avatar asked Aug 10 '11 12:08

David542


1 Answers

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
            else:
                pairs.append((actor, ""))

print(pairs)

Output:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
 ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
 ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]
like image 82
Tim Pietzcker Avatar answered Sep 29 '22 03:09

Tim Pietzcker