Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting semantic/stylistic features from text

I would like to know of open source tools (for java/python) which could help me extract semantic & stylistic features from text. Examples of semantic features would be adjective-noun ratio, a particular sequence of part-of-speech tags (adjective followed by a noun: adj|nn) etc. Examples of stylistic features would be number of unique words, number of pronouns etc. Currently, I know only of Word to Web Tools which converts a block of text into the rudimentary vector space model.

I am aware of few text-mining packages like GATE, NLTK , Rapid Miner, Mallet and MinorThird . However, I couldn't find any mechanism to suit my task.

Regards,
--Denzil

like image 928
Dexter Avatar asked Jun 24 '10 12:06

Dexter


1 Answers

I think that the Stanford Parser is one of the best and comprehensive NLP tools available for free: not only will it allow you to parse the structural dependencies (to count nouns/adjectives) but it will also give you the grammatical dependencies in the sentence (so you can extract the subject, object, etc). The latter component is something that Python libraries simply cannot do yet (see Does NLTK have a tool for dependency parsing?) and is probably going to be the most important feature in regards to your software's ability to work with semantics.

If you're interested in Java and Python tools, then Jython is probably the most fun to use for you. I was in the exact same boat, so I wrote this post about using Jython to run the example code provided in the Stanford Parser - I would give it a glance and see what you think: http://blog.gnucom.cc/2010/using-the-stanford-parser-with-jython/

Edit: After reading one of your comments I learned you need to parse 29 Million sentences. I think you could benefit greatly by using pure Java to combine two really powerful technologies: Stanford Parser + Hadoop. Both are written purely in Java and have an extremely rich API that you can use to parse vasts amount of data in a fraction of the time on a cluster of machines. If you don't have the machines, you can use Amazon's EC2 cluster. If you need an example of using Stanford Parser + Hadoop leave a comment for me, and I'll update the post with a URL to my example.

like image 131
sholsapp Avatar answered Sep 17 '22 13:09

sholsapp