I am trying to determine the most popular keywords for certain class of documents in my collection. Assuming that the domain is "computer science" (which of course, includes networking, computer architecture, etc.) what is the best way to preserve these domain-specific keywords from text? I tried using Wordnet but I am not quite how to best use it to extract this information.
Are there any well-known list of words that I can use as a whitelist considering the fact that I am not aware of all domain-specific keywords beforehand? Or are there any good nlp/machine learning techniques to identity domain specific keywords?
You need a huge training set of documents. Small subset of this collection (but still large set of documents) should represent given domain. Using nltk calculate words statistics taking into account morphology, filter out stopwords. The good statistics is TF*IDF which is roughly a number of occurenses of a word in the domain subset divided by number of documents containing the word in a whole collection. Keywords are words with greatest TF*IDF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With