Given a string dialogue such as below, I need to find the sentence that corresponds to each user.
text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'
For the above dialogue, I would like to return tuples with three elements with:
The name of the person
The sentence in lower case and
The sentences within Brackets
Something like this:
('CHRIS', 'Hello, how are you...', None)
('PETER', 'Great, you?', None)
('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')
('CHRIS', 'Are you ok?', None)
etc...
I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.
actors = re.findall(r'\w+(?=\s*:[^/])',text)
You can do this with re.findall
:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split
into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the *
to unpack the result of processWords
into the tuple
is strictly a python 3 feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With