Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match names, dialogues, and actions from transcript using regex

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.

text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'

For the above dialogue, I would like to return tuples with three elements with:

  1. The name of the person

  2. The sentence in lower case and

  3. The sentences within Brackets

Something like this:

('CHRIS', 'Hello, how are you...', None)

('PETER', 'Great, you?', None)

('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')

('CHRIS', 'Are you ok?', None)

etc...

I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.

actors = re.findall(r'\w+(?=\s*:[^/])',text)
like image 253
pbou Avatar asked Dec 24 '18 15:12

pbou


2 Answers

You can do this with re.findall:

>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
 ('PETER', ' Great, you? ', ''),
 ('PAM',
  ' He is resting.',
  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
 ('CHRIS', ' Are you ok?', '')]

You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.

Regex Breakdown

\b              # Word boundary
(\S+)           # First capture group, string of characters not having a space
:               # Colon
(               # Second capture group
    [^          # Match anything that is not...
        :       #     a colon
        \[\]    #     or square braces
    ]+?         # Non-greedy match
)
\n?             # Optional newline
(               # Third capture group
    \[          # Literal opening brace
    [^:]+?      # Similar to above - exclude colon from match
    \] 
    \n?         # Optional newlines
)?              # Third capture group is optional
(?=             # Lookahead for... 
    \b          #     a word boundary, followed by  
    \S+         #     one or more non-space chars, and
    :           #     a colon
    |           # Or,
    $           # EOL
)
like image 171
cs95 Avatar answered Oct 07 '22 21:10

cs95


Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.

For example, we could first find groups of names and text:

from itertools import groupby

def isName(word):
    # Names end with ':'
    return word.endswith(":")

text_split = [
    " ".join(list(g)).rstrip(":") 
    for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']

Next you can collect pairs of consecutive elements in text_split into tuples:

print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]

We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)

Here's something quick that I came up with:

def isClosingBracket(word):
    return word.endswith("]")

def processWords(words):
    if "[" not in words:
        return [words, None]
    else:
        return [
            " ".join(g).replace("]", ".") 
            for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
        ]

print(
    [(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]

Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

like image 34
pault Avatar answered Oct 07 '22 23:10

pault