I want to build a list of ~6 keywords (or even better: couple word keyphrases) for each message in a message forum. <ul> <li>The primary use of keywords is to replace subject lines in some instances. For example: Message from Terry sent Dec 5, keywords: norweigan blue, plumage, not dead </li> <li>In a super ideal world keywords would identify both unique phases, and phrases that cluster the discussion into "topics", i.e. words that are highly relevant to the message in question, and a few other messages in the forum, but not found frequently in the forum as a whole.</li> <li>I expect junk phrases to show up, no big deal.</li> <li>Can't be too computationally expensive: I need something that can handle several hundred messages in several seconds, as I'll need to re-run this every time a new message comes in.</li> </ul> Anyone know a good C# library for accomplishing this? Maybe there's a way to bend Lucene.NET into providing this sort of info? Or, failing that, can anyone suggest an algorithm (or set of algos) to read up on? If I'm implementing myself I need something not terribly complex, I can only tackle this if its tractable in about a week. Right now, the best I've found in terms of simple-but-effective is TF-IDF. UPDATE: I've uploaded the results of using TF-IDF to select the top 5 keywords from a real dataset here: http://jsbin.com/oxanoc/2/edit#preview The results are mediocre, but not totally useless... maybe with the addition of detecting multi-word phrases, this would be good enough.

I've implemented a keywords extraction algorithm in Java a few weeks ago for uni. project, and used the tf-idf model. Algorithm: First, we looked for all bigrams in the paragraph, and extracted the meaningful ones. (*) Next, we took the set of unigrams and bigrams, and evaluated each with is respective tf-idf score. The idf score of each term was the "documents count" retrieved by Bing API. (*) Deciding which bi-gram is meaningful: We used a various heuristics to find which bi-gram can be considered meaningful. At the end, the best results were achieved by "asking" wikipedia: we searched for the bi-gram. If there is an article containing this bi-gram, we considered it meaningful. Evaluation: We evaluated the algorithm on a set of 50 abstracts from random articles, and extracted the precision and recall of these algorithms. The result was ~40% recall and ~35% precision, which is not too bad.

Algorithm (or C# library) for identifying 'keywords' in a set of messages? [closed]

Tags:

c#

algorithm

search

nlp

text-mining

I want to build a list of ~6 keywords (or even better: couple word keyphrases) for each message in a message forum.

The primary use of keywords is to replace subject lines in some instances. For example: Message from Terry sent Dec 5, keywords: norweigan blue, plumage, not dead
In a super ideal world keywords would identify both unique phases, and phrases that cluster the discussion into "topics", i.e. words that are highly relevant to the message in question, and a few other messages in the forum, but not found frequently in the forum as a whole.
I expect junk phrases to show up, no big deal.
Can't be too computationally expensive: I need something that can handle several hundred messages in several seconds, as I'll need to re-run this every time a new message comes in.

Anyone know a good C# library for accomplishing this? Maybe there's a way to bend Lucene.NET into providing this sort of info?

Or, failing that, can anyone suggest an algorithm (or set of algos) to read up on? If I'm implementing myself I need something not terribly complex, I can only tackle this if its tractable in about a week. Right now, the best I've found in terms of simple-but-effective is TF-IDF.

UPDATE: I've uploaded the results of using TF-IDF to select the top 5 keywords from a real dataset here: http://jsbin.com/oxanoc/2/edit#preview

The results are mediocre, but not totally useless... maybe with the addition of detecting multi-word phrases, this would be good enough.

704

asked Jan 01 '12 21:01

Seth

1 Answers

I've implemented a keywords extraction algorithm in Java a few weeks ago for uni. project, and used the tf-idf model.

Algorithm:
First, we looked for all bigrams in the paragraph, and extracted the meaningful ones. (*)
Next, we took the set of unigrams and bigrams, and evaluated each with is respective tf-idf score. The idf score of each term was the "documents count" retrieved by Bing API.

(*) Deciding which bi-gram is meaningful:
We used a various heuristics to find which bi-gram can be considered meaningful. At the end, the best results were achieved by "asking" wikipedia: we searched for the bi-gram. If there is an article containing this bi-gram, we considered it meaningful.

Evaluation:
We evaluated the algorithm on a set of 50 abstracts from random articles, and extracted the precision and recall of these algorithms.
The result was ~40% recall and ~35% precision, which is not too bad.

125

answered Oct 13 '22 21:10

amit

Related questions
                            
                                how to check iis version on serve programmatically
                            
                                When to dispose of System.Threading.Task with child tasks?
                            
                                NHibernate, Log query execution time?
                            
                                Activation error occured while trying to get instance of type Database, key "" <-- blank
                            
                                Soft deletes, navigation properties in EF4 CTP5 POCO
                            
                                ado.net transaction.commit throws semaphorefullexception
                            
                                SpeechSynthesizer .NET control pitch
                            
                                Volatile fields in C#
                            
                                Honeywell Dolphin 9500 (Pocket PC 2003) C# Event Handling Conflicts?
                            
                                JavaScriptSerializer().Serialize : PascalCase to CamelCase
                            
                                using Graphviz Dlls in asp.net c# application
                            
                                Exception: Instance 'Name of instance' does not exist in the specified Category
                            
                                Possible to interact with a 64-bit COM server (Photoshop) from .NET?
                            
                                WCF Discovery .NET 4: Problem with config / programmatically definition
                            
                                How do you use the mvc-mini-profiler with Entity Framework 4.1
                            
                                Visual Studio, Razor, BuildProviders and Intellisense
                            
                                Visual studio 2010 empties the file on crash
                            
                                How to enable inPrivate mode in the WebBrowser control
                            
                                Why isn't there Math.Pow that takes an int as the exponent?
                            
                                Windows/.NET Auto-Update Frameworks Feature Sets, and Security

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With