Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

Additionally, I will be grateful if you point any Python based solution / library for this.

Thanks

like image 665
Hellnar Avatar asked Apr 18 '10 09:04

Hellnar


2 Answers

One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.

To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.

To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.

Borrowing from my answer to that question, the NLTK collocations how-to covers how to do extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:

import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures()  # change this to read in your data finder = BigramCollocationFinder.from_words(    nltk.corpus.genesis.words('english-web.txt'))  # only bigrams that appear 3+ times finder.apply_freq_filter(3)   # return the 5 n-grams with the highest PMI finder.nbest(bigram_measures.pmi, 5)   
like image 85
dmcer Avatar answered Sep 22 '22 23:09

dmcer


First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.

Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.

Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.

text = """ The intensity of the feeling makes up for the disproportion of the objects.  Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love.  When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """  BAD_CHARS = ".!?,\'\""  # transform text into a list words--removing punctuation and filtering small words words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]  word_freq = {}  # generate a 'word histogram' for the text--ie, a list of the frequencies of each word for word in words :   word_freq[word] = word_freq.get(word, 0) + 1  # sort the word list by frequency  # (just a DSU sort, there's a python built-in for this, but i can't remember it) tx = [ (v, k) for (k, v) in word_freq.items()] tx.sort(reverse=True) word_freq_sorted = [ (k, v) for (v, k) in tx ]  # eg, what are the most common words in that text? print(word_freq_sorted) # returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)] # obviously using a text larger than 50 or so words will give you more meaningful results  term_importance = lambda word : 1.0/word_freq[word]  # select document keywords from the words at/near the top of this list: map(term_importance, word_freq.keys()) 
like image 20
doug Avatar answered Sep 19 '22 23:09

doug