Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex find Proper Nouns or Phrases that are NOT first word in a sentence

Tags:

regex

vb.net

I've found several questions that touch on this, but none that seem to answer it. I am trying to build a Regex that will allow me to identify Proper Nouns in a group of text.

I am defining a Proper Noun as follows: A word or group of words that begin with a capital letter, are longer than 1 digit (to exclude things like I, A, etc), and are NOT the first word of a new sentence.

So, in the following text

"Susan Dow stayed at the Holiday Inn on Thursday. She met Tom and Shirley Temple at the bar where they ordered Green Eggs and Ham"

I would want the following returned

Holiday Inn Thursday Tom Shirley Temple Green Eggs Ham

Right now, [A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)* is what I have, but it's returning Susan Dow and She in addition to the ones listed above. How can I get my . look-up to work?

like image 567
user2938667 Avatar asked Oct 20 '22 21:10

user2938667


1 Answers

You can use:

(?<!^|\. |\.  )[A-Z][a-z]+

per this rubular

Update: Integrated the two negative looks using alternation. Also added check for two spaces between sentences. Note that repetition operators cannot be used in negative lookbehinds per notes in http://www.regular-expressions.info/lookaround.html

like image 73
Peter Alfvin Avatar answered Oct 24 '22 01:10

Peter Alfvin