I need to extract all English verbs from a given text and I was wondering how I could do it... At first glance, my idea is to use regular expressions because all English verb tenses follow patterns but maybe there is another way to do it. What I've thought is simply:

  1. Create a pattern for every verb tense. I have to distinguish between regular verbs (http://en.wikipedia.org/wiki/English_verbs) and irregular verbs (http://www.chompchomp.com/rules/irregularrules01.htm) in some way.
  2. Iterate over these patterns and split the text using them (the last word of each substring is supposed to be the verb that gives complete meaning to the sentence, which I need for other purposes -> nominalization)

What do you think? I guess this isn't an efficient way to do it but I can't imagine another one.

Thank you in advance!


  1. I have two dictionaries, one for all English Verbs and the other one for all English nouns
  2. The main problem of all this is that the project consists on verb nominalization (is just a uni project), so all the "effort" is supposed to be focused in this part, nominalization. In concrete, I follow this model: acl.ldc.upenn.edu/P/P00/P00-1037.pdf). The project consists on given a text, find all the verbs in that text and propose multiple nominalizations for each verb. So the first step (finding verbs), should be as simple as possible... but I can't use any parser, it's not allowed
Part of Speech tagger

Identifying and then extracting all the verbs within a text is very easy using a Part-of-Speech (POS) tagger. Such taggers label all of the words in a text with part-of-speech tags that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS taggers are very accurate. For example, Toutanova et al. 2003 reports Stanford's open source POS tagger assigns the correct tag 97.24% of time on newswire data.

Performing POS tagging

Java If you're using Java, a good package for POS tagging is the Stanford Log-linear Part-Of-Speech Tagger. Matthew Jockers put together a great tutorial on using this tagger that you can find here.

Python If you prefer Python, you can make use of the POS tagger included in the Natural Language Toolkit (nltk). A code snippet demonstrating how to perform POS tagging using this package is given below:

import nltk

text = "I am very happy to be here today"
tokens = nltk.word_tokenize(text)
pos_tagged_tokens = nltk.pos_tag(tokens)

The resulting POS tagged tokens will be an array of tuples, where the first entry in each tuple is the identity of the tagged word and the second entry is the word's POS tag, e.g. for the code snippet above pos_tagged_tokens will be set to:

[('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('happy', 'JJ'), ('to', 'TO'), 
 ('be', 'VB'), ('here', 'RB'), ('today', 'NN')]

Understanding the Tag Set

Both the Stanford POS tagger and NLTK use the Penn Treebank tag set. If you're just interested in extracting the verbs, pull out all words that have a POS tag that starts with a "V" (e.g., VB, VBD, VBG, VBN, VBP, and VBZ).

Parsing natural language with regex is impossible. Forget it.

As a drastic example: How would you find the verbs (marked with asterisks) in this sentence?

Buffalo buffalo Buffalo buffalo buffalo* buffalo* Buffalo buffalo

While you'll hardly come across extreme cases like this, there are dozens of verbs that could also be nouns, adjectives etc if you just look at the word.

You need a natural language parser like Stanford NLP. I have never used one, so I don't know how good your results are going to be, but better than with Regex, I can tell you that.

