I'm looking for java based tools for extracting relevant tags from a given article. I need a tool that will basically try and identify what are the main subjects and terms a given article is related to. Thanks for helping.
Check the following key words/topics extraction software/tools:
If you would like to develop your own topic detection system, you should take a look on LDA implementation in mallet (link to a working LDA sample, the one on mallet homepage does not work with the newest mallet version).
You can use HtmlUnit to parse the article's HTML and query for the parts of the document you are interested in searching. Then you can apply a simple algorithm of your own design to determine tags/keywords.
Like for instance, split() the text on whitespace and then count how many times each word occurs. The words that occur the most (ignoring things like "and", "the", "if", etc.) are good candidates for keywords.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With