So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK..
For example, when I do:
nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))
It gives me:
[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]
How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')
You could try taking a look at nltk.RegexParser. It allows you to chunk part of speech tagged content based on regular expressions. In your example, you could do something like
pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t
This would give you:
Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With