Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a simple way to generate keywords from a text?

I suppose I could take a text and remove high frequency English words from it. By keywords, I mean that I want to extract words that are most the characterizing of the content of the text (tags ) . It doesn't have to be perfect, a good approximation is perfect for my needs.

Has anyone done anything like that? Do you known a Perl or Python library that does that?

Lingua::EN::Tagger is exactly what I asked however I needed a library that could work for french text too.

like image 399
Emmanuel Caradec Avatar asked Jan 21 '09 15:01

Emmanuel Caradec


People also ask

What are key words in a text?

A keyword is a word, phrase, or other combination of numbers and letters that allows people to receive SMS marketing and communications messages.


2 Answers

The name for the "high frequency English words" is stop words and there are many lists available. I'm not aware of any python or perl libraries, but you could encode your stop word list in a binary tree or hash (or you could use python's frozenset), then as you read each word from the input text, check if it is in your 'stop list' and filter it out.

Note that after you remove the stop words, you'll need to do some stemming to normalize the resulting text (remove plurals, -ings, -eds), then remove all the duplicate "keywords".

like image 60
florin Avatar answered Sep 20 '22 06:09

florin


You could try using the perl module Lingua::EN::Tagger for a quick and easy solution.

A more complicated module Lingua::EN::Semtags::Engine uses Lingua::EN::Tagger with a WordNet database to get a more structured output. Both are pretty easy to use, just check out the documentation on CPAN or use perldoc after you install the module.

like image 27
andymurd Avatar answered Sep 19 '22 06:09

andymurd