I am trying to get the basic english word for an english word which is modified from its base form. This question had been asked here, but I didnt see a proper answer, so I am trying to put it this way. I tried 2 stemmers and one lemmatizer from NLTK package which are porter stemmer, snowball stemmer, and wordnet lemmatiser.
I tried this code:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
words = ['arrival','conclusion','ate']
for word in words:
print "\n\nOriginal Word =>", word
print "porter stemmer=>", PorterStemmer().stem(word)
snowball_stemmer = SnowballStemmer("english")
print "snowball stemmer=>", snowball_stemmer.stem(word)
print "WordNet Lemmatizer=>", WordNetLemmatizer().lemmatize(word)
This is the output I get:
Original Word => arrival
porter stemmer=> arriv
snowball stemmer=> arriv
WordNet Lemmatizer=> arrival
Original Word => conclusion
porter stemmer=> conclus
snowball stemmer=> conclus
WordNet Lemmatizer=> conclusion
Original Word => ate
porter stemmer=> ate
snowball stemmer=> ate
WordNet Lemmatizer=> ate
but I want this output
Input : arrival
Output: arrive
Input : conclusion
Output: conclude
Input : ate
Output: eat
How can I achieve this? Are there any tools already available for this? This is called as morphological analysis. I am aware of that, but there must be some tools which are already achieving this. Help is appreciated :)
First Edit
I tried this code
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
query = "The Indian economy is the worlds tenth largest by nominal GDP and third largest by purchasing power parity"
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return wn.NOUN
tags = nltk.pos_tag(word_tokenize(query))
for tag in tags:
wn_tag = penn_to_wn(tag[1])
print tag[0]+"---> "+WordNetLemmatizer().lemmatize(tag[0],wn_tag)
Here, I tried to use wordnet lemmatizer by providing proper tags. Here is the output:
The---> The
Indian---> Indian
economy---> economy
is---> be
the---> the
worlds---> world
tenth---> tenth
largest---> large
by---> by
nominal---> nominal
GDP---> GDP
and---> and
third---> third
largest---> large
by---> by
purchasing---> purchase
power---> power
parity---> parity
Still, words like "arrival" and "conclusion" wont get processed with this approach. Is there any solution for this?
In English grammar, a base is the form of a word to which prefixes and suffixes can be added to create new words. For example, instruct is the base for forming instruction, instructor, and reinstruct. Also called a root or stem. Put another way, base forms are words that are not derived from or made up of other words.
basic form (plural basic forms) The uninflected form of a word used as a dictionary entry.
Roots/Base words are morphemes that form the base of a word, and usually carry its meaning.
A base word is a word that can have a prefix or a suffix added to it. When a prefix or suffix is added to a base word, the word's meaning changes and a new word is formed. A prefix is added to the beginning of a base word.
Ok, so... for the word "ate" I think you're looking for NodeBox::Linguistics.
print en.verb.present("gave")
>>> give
And I did not completely understand why do you want the verb or "arrival" but not the one of "conclusion".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With