Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Designing a Regex to find any Noun Phrase

I'm trying to build a chunker (or shallow parser) using regular expressions (and without NLTK), but can't come up with a regular expression that does what I want it to do . Here's my immediate goal: find all noun phrases in a natural language text.

My first step is to tag all sentences with my home-brewed part of speech tagger, and then to join the list of token/tag pairs into a single string like so:

'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN'

My next step is to use a regular expression to search the string for instances of noun phrases. Now the general linguistic formula for a noun phrase is: an optional determiner (DT), zero or more adjectives (JJ), and a noun (NN), proper noun (NP), or pronoun (PRN). Given this general formula, I tried this regular expression (keep in mind the tagged string alternates between words and tags):

'(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))'

Here's my code:

text = 'he PRN and CC bill NP could MOD hear VB them PRN on IN the DT large JJ balcony NN near IN the DT house NN'

regex = re.compile(r'(\w+ DT)? (\w+ JJ)* (\w+ (NN|NP|PRN))')
m = regex.findall(text)

if m:
     print m

And here's my output:

[('the DT', 'large JJ', 'balcony NN', 'NN')]

It's not finding pronouns or proper nouns, and for some reason only matching the 'NN in a '\w+ DT \w+ NN' pattern. I assumed my regex would match these patersn since I set the determiner pattern a s optional (?) and the adjective pattern as zero or more times (*).

Chris

like image 277
user3609038 Avatar asked Jun 24 '14 01:06

user3609038


1 Answers

Use this:

(?:(?:\w+ DT )?(?:\w+ JJ )*)?\w+ (?:N[NP]|PRN)

See demo.

  • (?:(?:\w+ DT )?(?:\w+ JJ )*)? optionally matches the DT, followed by zero or more ajectives
  • '\w+ (?:N[NP]|PRN)' matched the NN, NP or PRN
like image 147
zx81 Avatar answered Oct 19 '22 19:10

zx81