I've used LingPipe, Stanford's NER, RiTa and various sentence similarity libraries for my previous Java projects that focused on text (pre)processing (indexing, xml tagging, topic detection, etc.) of large amounts of English text (around 10,000 documents summing to > 1gb of text). Maybe I'm a bad Java programmer, but I find myself typing a lot of code and using a lot of libraries when I switch to a different corpus. Overall, I feel like there might be a better tool for the job.
I guess my question is, will I benefit from switching to Python and NLTK for information retrieval / language processing? Or are there enough pros and cons to make it very subjective? Is NLTK intuitive enough to be learned quickly?
I'd get my hands dirty, but I won't have access to a personal machine for the next few days.
NLTK is good for natural language processing. I've used it for my data-mining project. You can train your own analyzer. The learning curve is not steep.
NLTK got huge corpus for training of your analyzer. You can also provide your own set of data, for example, a journal which a part-of-speech tagged.
Because python is very good for text processing, you may to give it a try. Plus, it got a online tutorial
Please don't forget to use python 2.x version. Try python 2.6. NLTK may not be good with python 3.x
If you already understand the basics of NLP, I think NLTK should be pretty easy to pick up. It's got a bunch of documentation, 2 books, and I've written a number of articles & tutorials on streamhacker.com. And if there's anything from the Java packages you don't want to lose, you could theoretically combine it with NLTK using Jython (and perhaps execnet).
You also may want to take a look at the Pattern library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With