I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:
"said Polly Pocket and the toys" -> Polly Pocket
Here is the regex I am using:
re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)
It returns the following:
[('Polly Pocket', ' Pocket')]
I want it to return:
['Polly Pocket']
Use a positive look-ahead:
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:
( # begin capture
[A-Z] # one uppercase letter \ First Word
[a-z]+ # 1+ lowercase letters /
(?=\s[A-Z]) # must have a space and uppercase letter following it
(?: # non-capturing group
\s # space
[A-Z] # uppercase letter \ Additional Word(s)
[a-z]+ # lowercase letter /
)+ # group can be repeated (more words)
) #end capture
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With