Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserving only domain-specific keywords?

I am trying to determine the most popular keywords for certain class of documents in my collection. Assuming that the domain is "computer science" (which of course, includes networking, computer architecture, etc.) what is the best way to preserve these domain-specific keywords from text? I tried using Wordnet but I am not quite how to best use it to extract this information.

Are there any well-known list of words that I can use as a whitelist considering the fact that I am not aware of all domain-specific keywords beforehand? Or are there any good nlp/machine learning techniques to identity domain specific keywords?

like image 925
Legend Avatar asked Jan 19 '23 09:01

Legend


1 Answers

You need a huge training set of documents. Small subset of this collection (but still large set of documents) should represent given domain. Using nltk calculate words statistics taking into account morphology, filter out stopwords. The good statistics is TF*IDF which is roughly a number of occurenses of a word in the domain subset divided by number of documents containing the word in a whole collection. Keywords are words with greatest TF*IDF.

like image 160
Andrey Sboev Avatar answered Jan 20 '23 23:01

Andrey Sboev