I want to build a list of ~6 keywords (or even better: couple word keyphrases) for each message in a message forum.
Anyone know a good C# library for accomplishing this? Maybe there's a way to bend Lucene.NET into providing this sort of info?
Or, failing that, can anyone suggest an algorithm (or set of algos) to read up on? If I'm implementing myself I need something not terribly complex, I can only tackle this if its tractable in about a week. Right now, the best I've found in terms of simple-but-effective is TF-IDF.
UPDATE: I've uploaded the results of using TF-IDF to select the top 5 keywords from a real dataset here: http://jsbin.com/oxanoc/2/edit#preview
The results are mediocre, but not totally useless... maybe with the addition of detecting multi-word phrases, this would be good enough.
An algorithm is a procedure or step-by-step instruction for solving a problem. They form the foundation of writing a program. For writing any programs, the following has to be known: Input. Tasks to be preformed.
In my opinion, C would be the best language to learn data structures and algorithms because it will force you to write your own. It will force you to understand pointers, dynamic memory allocation, and the implementations behind the popular data structures like linked lists, hash tables, etc.
What Is an Algorithm? An algorithm is a set of instructions for solving a problem or accomplishing a task. One common example of an algorithm is a recipe, which consists of specific instructions for preparing a dish or meal.
We learned that the main difference is between the two is that an algorithm is a step-by-step procedure for solving the problem while programming is a set of instructions for a computer to follow to perform a task. A program could also be an implementation of code to instruct a computer on how to execute an algorithm.
I've implemented a keywords extraction algorithm in Java a few weeks ago for uni. project, and used the tf-idf model.
Algorithm:
First, we looked for all bigrams in the paragraph, and extracted the meaningful ones. (*)
Next, we took the set of unigrams and bigrams, and evaluated each with is respective tf-idf score. The idf score of each term was the "documents count" retrieved by Bing API.
(*) Deciding which bi-gram is meaningful:
We used a various heuristics to find which bi-gram can be considered meaningful. At the end, the best results were achieved by "asking" wikipedia: we searched for the bi-gram. If there is an article containing this bi-gram, we considered it meaningful.
Evaluation:
We evaluated the algorithm on a set of 50 abstracts from random articles, and extracted the precision and recall of these algorithms.
The result was ~40% recall and ~35% precision, which is not too bad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With