Preserving only domain-specific keywords?

Question

I am trying to determine the most popular keywords for certain class of documents in my collection. Assuming that the domain is "computer science" (which of course, includes networking, computer architecture, etc.) what is the best way to preserve these domain-specific keywords from text? I tried using Wordnet but I am not quite how to best use it to extract this information.

Are there any well-known list of words that I can use as a whitelist considering the fact that I am not aware of all domain-specific keywords beforehand? Or are there any good nlp/machine learning techniques to identity domain specific keywords?

Andrey Sboev · Accepted Answer

You need a huge training set of documents. Small subset of this collection (but still large set of documents) should represent given domain. Using nltk calculate words statistics taking into account morphology, filter out stopwords. The good statistics is TF*IDF which is roughly a number of occurenses of a word in the domain subset divided by number of documents containing the word in a whole collection. Keywords are words with greatest TF*IDF.

Preserving only domain-specific keywords?

Tags:

python

machine-learning

nlp

nltk

Legend

1 Answers

Andrey Sboev

Recent Activity

Donate For Us

Preserving only domain-specific keywords?

Tags:

python

machine-learning

nlp

nltk

Legend

1 Answers

Andrey Sboev

Related questions

Recent Activity

Donate For Us