Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm (or C# library) for identifying 'keywords' in a set of messages? [closed]

I want to build a list of ~6 keywords (or even better: couple word keyphrases) for each message in a message forum.

  • The primary use of keywords is to replace subject lines in some instances. For example: Message from Terry sent Dec 5, keywords: norweigan blue, plumage, not dead
  • In a super ideal world keywords would identify both unique phases, and phrases that cluster the discussion into "topics", i.e. words that are highly relevant to the message in question, and a few other messages in the forum, but not found frequently in the forum as a whole.
  • I expect junk phrases to show up, no big deal.
  • Can't be too computationally expensive: I need something that can handle several hundred messages in several seconds, as I'll need to re-run this every time a new message comes in.

Anyone know a good C# library for accomplishing this? Maybe there's a way to bend Lucene.NET into providing this sort of info?

Or, failing that, can anyone suggest an algorithm (or set of algos) to read up on? If I'm implementing myself I need something not terribly complex, I can only tackle this if its tractable in about a week. Right now, the best I've found in terms of simple-but-effective is TF-IDF.

UPDATE: I've uploaded the results of using TF-IDF to select the top 5 keywords from a real dataset here: http://jsbin.com/oxanoc/2/edit#preview

The results are mediocre, but not totally useless... maybe with the addition of detecting multi-word phrases, this would be good enough.

like image 704
Seth Avatar asked Jan 01 '12 21:01

Seth


People also ask

What is the algorithm in C?

An algorithm is a procedure or step-by-step instruction for solving a problem. They form the foundation of writing a program. For writing any programs, the following has to be known: Input. Tasks to be preformed.

Is C good for algorithm?

In my opinion, C would be the best language to learn data structures and algorithms because it will force you to write your own. It will force you to understand pointers, dynamic memory allocation, and the implementations behind the popular data structures like linked lists, hash tables, etc.

What is algorithm with example?

What Is an Algorithm? An algorithm is a set of instructions for solving a problem or accomplishing a task. One common example of an algorithm is a recipe, which consists of specific instructions for preparing a dish or meal.

What is algorithm vs programming?

We learned that the main difference is between the two is that an algorithm is a step-by-step procedure for solving the problem while programming is a set of instructions for a computer to follow to perform a task. A program could also be an implementation of code to instruct a computer on how to execute an algorithm.


1 Answers

I've implemented a keywords extraction algorithm in Java a few weeks ago for uni. project, and used the tf-idf model.

Algorithm:
First, we looked for all bigrams in the paragraph, and extracted the meaningful ones. (*)
Next, we took the set of unigrams and bigrams, and evaluated each with is respective tf-idf score. The idf score of each term was the "documents count" retrieved by Bing API.

(*) Deciding which bi-gram is meaningful:
We used a various heuristics to find which bi-gram can be considered meaningful. At the end, the best results were achieved by "asking" wikipedia: we searched for the bi-gram. If there is an article containing this bi-gram, we considered it meaningful.

Evaluation:
We evaluated the algorithm on a set of 50 abstracts from random articles, and extracted the precision and recall of these algorithms.
The result was ~40% recall and ~35% precision, which is not too bad.

like image 125
amit Avatar answered Oct 13 '22 21:10

amit