NLTK in production environment?

Question

I have developed few algorithms for clustering, data abstraction etc in python nltk. Now, the problem is, I am about to make it large scale before presenting before VCs. NLTK has its own advantages like fast development etc. But this made sense to me when I chose in the beginning. Now I am mature enough, and find several limitations of it, like lack of scalability. Did some research on Mahout, but that too is for cluster/categorization and collocation. Open NLP is an option, but I am not sure how long can I go with it. Anything good around for high scale nlp?

Please note - this question is not related to my older question - How can I improve the performance of NLTK? alternatives?. I have already read NLTK on a production web application completely.

pvoosten · Accepted Answer

NLTK is indeed a good learning platform, but is not designed to robustly serve millions of customers.

You can approach your scalability issues in two different ways:

The first "big data" approach: adapt your algorithms to MapReduce and run them on MongoDB/Hadoop/Google MapReduce/... There are different places to host such solutions (Amazon, Google, Rackspace, ...)
The second, "roll your own" approach: work with common hosting solutions, or your own datacenter.

The "big data" approach

This means rethinking your algorithms. Requires good mathematical background and sound understanding of the algorithms. Maybe you would even replace algorithms, because execution time is less related to amount of work.

So in terms of implementing your idea, this may be the most difficult (and maybe even impossible) solution, depending on your skills. For deployment and future benefits, this is by far the easiest solution.

The "roll your own" approach

You can mean different things with scalability:

larger training sets
more customers
more algorithms and applications
growing your training sets can mean either retrain or adapt
...

There are different orders of magnitude concerning scalability: do you want to scale 10-fold, 100-fold, 1000-fold, ...?

There are different methods to overcome scalability issues:

Parallellize: add exact copies of a server and do load balancing
Pipelining: split processing in different steps that can take place on different servers
More expensive hardware, faster CPU, RAM, disk, buses, ASICs, ...
Client side processing
Caching of requests
Performance tuning of your software, implement bottlenecks in C/C++
Use better algorithms
Smarter separation of what happens offline (e.g. with a cron job) and what is done per request.
...

Whatever the type of scalability, and whatever the method you use to overcome it, do a load test to see what you can handle. Since you can't afford all your hardware instantly, there are different ways to do a load test for a scaled infrastructure:

rent processors, memory, disk space, ... per hour, just enough to do the load test and bail out. That way, you don't need to buy equipment.
more risky: do a load test on less and cheaper equipment than will be in production and extrapolate the results. Maybe you have a theoretical model of how your algorithms scales, but beware of side effects. The proof of the pudding is in the eating.

Approaching VCs (as far as scalability is concerned)

Create a prototype that clearly self-explains your idea (not necessarily scalable)
Prove to yourself that everything will be ok at some point in the future and at what cost (min/expected/max one-time/continuous cost)
Start with a private beta, so that scalability is not an issue right from the start. No deadline to go out of beta. An estimate is ok, but no deadline. Don't compromise on that!

Good luck!

NLTK in production environment?

Tags:

python

nltk

opennlp

akshayb

1 Answers

The "big data" approach

The "roll your own" approach

Approaching VCs (as far as scalability is concerned)

pvoosten

Recent Activity

Donate For Us

NLTK in production environment?

Tags:

python

nltk

opennlp

akshayb

1 Answers

The "big data" approach

The "roll your own" approach

Approaching VCs (as far as scalability is concerned)

pvoosten

Related questions

Recent Activity

Donate For Us