I need to extract all English verbs from a given text and I was wondering how I could do it... At first glance, my idea is to use regular expressions because all English verb tenses follow patterns but maybe there is another way to do it. What I've thought is simply:
What do you think? I guess this isn't an efficient way to do it but I can't imagine another one.
Thank you in advance!
PS:
Part of Speech tagger
Identifying and then extracting all the verbs within a text is very easy using a Part-of-Speech (POS) tagger. Such taggers label all of the words in a text with part-of-speech tags that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS taggers are very accurate. For example, Toutanova et al. 2003 reports Stanford's open source POS tagger assigns the correct tag 97.24% of time on newswire data.
Performing POS tagging
Java If you're using Java, a good package for POS tagging is the Stanford Log-linear Part-Of-Speech Tagger. Matthew Jockers put together a great tutorial on using this tagger that you can find here.
Python If you prefer Python, you can make use of the POS tagger included in the Natural Language Toolkit (nltk). A code snippet demonstrating how to perform POS tagging using this package is given below:
import nltk
text = "I am very happy to be here today"
tokens = nltk.word_tokenize(text)
pos_tagged_tokens = nltk.pos_tag(tokens)
The resulting POS tagged tokens will be an array of tuples, where the first entry in each tuple is the identity of the tagged word and the second entry is the word's POS tag, e.g. for the code snippet above pos_tagged_tokens
will be set to:
[('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('happy', 'JJ'), ('to', 'TO'),
('be', 'VB'), ('here', 'RB'), ('today', 'NN')]
Understanding the Tag Set
Both the Stanford POS tagger and NLTK use the Penn Treebank tag set. If you're just interested in extracting the verbs, pull out all words that have a POS tag that starts with a "V" (e.g., VB, VBD, VBG, VBN, VBP, and VBZ).
Parsing natural language with regex is impossible. Forget it.
As a drastic example: How would you find the verbs (marked with asterisks) in this sentence?
Buffalo buffalo Buffalo buffalo buffalo* buffalo* Buffalo buffalo
While you'll hardly come across extreme cases like this, there are dozens of verbs that could also be nouns, adjectives etc if you just look at the word.
You need a natural language parser like Stanford NLP. I have never used one, so I don't know how good your results are going to be, but better than with Regex, I can tell you that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With