i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.
Solution 1
Solution 2
I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:
PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
What i want to know is how to calculate P(term, doc)
I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.
Please help me out. Best regards.
There are lot of algorithms for doing this:
open source tools:
kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary
maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.
carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.
mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)
Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)
Mahout clustering algorithms(http://mahout.apache.org/)
Commercial api:
Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)
zemanta API(http://www.zemanta.com/developer/)
yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With