Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract keywords from a block of text in Haskell

Tags:

haskell

nlp

So I know this is a kind of a large topic, but I need to accept a chunk of text, and extract the most interesting keywords from it. The text comes from TV captions, so the subject can range from news to sports to pop culture references. It is possible to provide the type of show the text came from.

I have an idea to match the text against a dictionary of terms I know to be interesting somehow.

Which libraries for Haskell can help me with this?

Assuming I do have a dictionary of interesting terms, and a database to store them in, is there a particular approach you'd recommend to matching keywords within the text?

Is there an obvious approach I'm not thinking of?

like image 452
Sean Clark Hess Avatar asked Nov 12 '11 22:11

Sean Clark Hess


2 Answers

I'd stem the words in the chunks and then search for all terms in the dict just two random libs:

stem http://hackage.haskell.org/packages/archive/stemmer/0.2/doc/html/NLP-Stemmer-C.html

search http://hackage.haskell.org/packages/archive/sphinx/0.2.1/doc/html/Text-Search-Sphinx.html

like image 66
bpgergo Avatar answered Nov 11 '22 12:11

bpgergo


To expand on bpgergo answer (but I don't have any haskell-specific info), it's pretty straightforward to enter documents into a relational database and index them with SOLR/lucene or sphinx, either of which should have a stemmer in their default/suggested configuration. And then you can search on which docs have pairs, triples, etc of your list of "interesting terms"

You might look at Named entity recognition, statistically unusual Phrase Detection, auto-tag generation, topics like that. Lingpipe is a good place to start, also these books:

http://alias-i.com/lingpipe/demos/tutorial/read-me.html

http://www.manning.com/marmanis/excerpt_contents.html

http://www.manning.com/alag/excerpt_contents.html

like image 1
Gene T Avatar answered Nov 11 '22 12:11

Gene T