Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get consecutive capitalized words using regex

Tags:

python

regex

I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:

"said Polly Pocket and the toys" -> Polly Pocket

Here is the regex I am using:

re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)

It returns the following:

[('Polly Pocket', ' Pocket')]

I want it to return:

['Polly Pocket']
like image 887
egidra Avatar asked Mar 01 '12 23:03

egidra


1 Answers

Use a positive look-ahead:

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:

(                # begin capture
  [A-Z]            # one uppercase letter  \ First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=\s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    \s               # space
    [A-Z]            # uppercase letter   \ Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture
like image 82
Brad Christie Avatar answered Sep 21 '22 09:09

Brad Christie