Match names, dialogues, and actions from transcript using regex

Question

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.

text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'

For the above dialogue, I would like to return tuples with three elements with:

The name of the person
The sentence in lower case and
The sentences within Brackets

Something like this:

('CHRIS', 'Hello, how are you...', None)

('PETER', 'Great, you?', None)

('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')

('CHRIS', 'Are you ok?', None)

etc...

I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.

actors = re.findall(r'\w+(?=\s*:[^/])',text)

cs95 · Accepted Answer

You can do this with re.findall:

>>> re.findall(r'\b(\S+):([^:]+?)\n?($$[^:]+?$$\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
 ('PETER', ' Great, you? ', ''),
 ('PAM',
  ' He is resting.',
  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
 ('CHRIS', ' Are you ok?', '')]

You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.

Regex Breakdown

\b              # Word boundary
(\S+)           # First capture group, string of characters not having a space
:               # Colon
(               # Second capture group
    [^          # Match anything that is not...
        :       #     a colon
            #     or square braces
    ]+?         # Non-greedy match
)
\n?             # Optional newline
(               # Third capture group
    $$          # Literal opening brace
    [^:]+?      # Similar to above - exclude colon from match
    $$ 
    \n?         # Optional newlines
)?              # Third capture group is optional
(?=             # Lookahead for... 
    \b          #     a word boundary, followed by  
    \S+         #     one or more non-space chars, and
    :           #     a colon
    |           # Or,
    $           # EOL
)

pault · Answer

Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.

For example, we could first find groups of names and text:

from itertools import groupby

def isName(word):
    # Names end with ':'
    return word.endswith(":")

text_split = [
    " ".join(list(g)).rstrip(":") 
    for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']

Next you can collect pairs of consecutive elements in text_split into tuples:

print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]

We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)

Here's something quick that I came up with:

def isClosingBracket(word):
    return word.endswith("]")

def processWords(words):
    if "[" not in words:
        return [words, None]
    else:
        return [
            " ".join(g).replace("]", ".") 
            for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
        ]

print(
    [(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]

Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

Match names, dialogues, and actions from transcript using regex

Tags:

python

string

regex

pbou

2 Answers

cs95

pault

Recent Activity

Donate For Us

Match names, dialogues, and actions from transcript using regex

Tags:

python

string

regex

pbou

2 Answers

cs95

pault

Related questions

Recent Activity

Donate For Us