I have developed few algorithms for clustering, data abstraction etc in python nltk. Now, the problem is, I am about to make it large scale before presenting before VCs. NLTK has its own advantages like fast development etc. But this made sense to me when I chose in the beginning. Now I am mature enough, and find several limitations of it, like lack of scalability. Did some research on Mahout, but that too is for cluster/categorization and collocation. Open NLP is an option, but I am not sure how long can I go with it. Anything good around for high scale nlp?
Please note - this question is not related to my older question - How can I improve the performance of NLTK? alternatives?. I have already read NLTK on a production web application completely.
NLTK is indeed a good learning platform, but is not designed to robustly serve millions of customers.
You can approach your scalability issues in two different ways:
This means rethinking your algorithms. Requires good mathematical background and sound understanding of the algorithms. Maybe you would even replace algorithms, because execution time is less related to amount of work.
So in terms of implementing your idea, this may be the most difficult (and maybe even impossible) solution, depending on your skills. For deployment and future benefits, this is by far the easiest solution.
You can mean different things with scalability:
There are different orders of magnitude concerning scalability: do you want to scale 10-fold, 100-fold, 1000-fold, ...?
There are different methods to overcome scalability issues:
Whatever the type of scalability, and whatever the method you use to overcome it, do a load test to see what you can handle. Since you can't afford all your hardware instantly, there are different ways to do a load test for a scaled infrastructure:
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With