Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract keywords (tags) from text

i am currently trying to implement a tagging engine in Java and searched for solutions to extract keywords/tag from texts (articles). I have found some solutions on stackoverflow suggesting to use Pointwise Mutual Information.

Solution 1

Solution 2

I cant use pyton and nltk so i have to implement it myself. But i dont know how to calculate the probabilities. The equation looks like this:

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

What i want to know is how to calculate P(term, doc)

I already have a lange text corpus and a collection of articles. The articles are not part of the corpus. The corpus is indexed with lucene.

Please help me out. Best regards.

like image 720
BauerMitFackel Avatar asked Jan 15 '13 13:01

BauerMitFackel


1 Answers

There are lot of algorithms for doing this:

open source tools:

kea(http://www.nzdl.org/Kea/) supervised approach uses training data and controlled vocabulary

maui indexer(http://code.google.com/p/maui-indexer/) it is basically extension of kea which provide facility to use encyclopedia for key phrase extraction.

carrot2(http://project.carrot2.org/) unsupervised approach for key phrase extraction. it supports lot of variation of input, output format and parameters for key phrase extraction.

mallet topic modeling module(http://mallet.cs.umass.edu/topics.php)

Stanford topic modeling tool (http://nlp.stanford.edu/software/tmt/tmt-0.3/)

Mahout clustering algorithms(http://mahout.apache.org/)

Commercial api:

Alchemy API(http://www.alchemyapi.com/api/keyword-extraction/)

zemanta API(http://www.zemanta.com/developer/)

yahoo term extraction api(http://developer.yahoo.com/contentanalysis/)

like image 134
Paresh Behede Avatar answered Sep 28 '22 04:09

Paresh Behede