Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK in production environment?

I have developed few algorithms for clustering, data abstraction etc in python nltk. Now, the problem is, I am about to make it large scale before presenting before VCs. NLTK has its own advantages like fast development etc. But this made sense to me when I chose in the beginning. Now I am mature enough, and find several limitations of it, like lack of scalability. Did some research on Mahout, but that too is for cluster/categorization and collocation. Open NLP is an option, but I am not sure how long can I go with it. Anything good around for high scale nlp?

Please note - this question is not related to my older question - How can I improve the performance of NLTK? alternatives?. I have already read NLTK on a production web application completely.

like image 545
akshayb Avatar asked Apr 03 '13 13:04

akshayb


1 Answers

NLTK is indeed a good learning platform, but is not designed to robustly serve millions of customers.

You can approach your scalability issues in two different ways:

  • The first "big data" approach: adapt your algorithms to MapReduce and run them on MongoDB/Hadoop/Google MapReduce/... There are different places to host such solutions (Amazon, Google, Rackspace, ...)
  • The second, "roll your own" approach: work with common hosting solutions, or your own datacenter.

The "big data" approach

This means rethinking your algorithms. Requires good mathematical background and sound understanding of the algorithms. Maybe you would even replace algorithms, because execution time is less related to amount of work.

So in terms of implementing your idea, this may be the most difficult (and maybe even impossible) solution, depending on your skills. For deployment and future benefits, this is by far the easiest solution.

The "roll your own" approach

You can mean different things with scalability:

  • larger training sets
  • more customers
  • more algorithms and applications
  • growing your training sets can mean either retrain or adapt
  • ...

There are different orders of magnitude concerning scalability: do you want to scale 10-fold, 100-fold, 1000-fold, ...?

There are different methods to overcome scalability issues:

  • Parallellize: add exact copies of a server and do load balancing
  • Pipelining: split processing in different steps that can take place on different servers
  • More expensive hardware, faster CPU, RAM, disk, buses, ASICs, ...
  • Client side processing
  • Caching of requests
  • Performance tuning of your software, implement bottlenecks in C/C++
  • Use better algorithms
  • Smarter separation of what happens offline (e.g. with a cron job) and what is done per request.
  • ...

Whatever the type of scalability, and whatever the method you use to overcome it, do a load test to see what you can handle. Since you can't afford all your hardware instantly, there are different ways to do a load test for a scaled infrastructure:

  • rent processors, memory, disk space, ... per hour, just enough to do the load test and bail out. That way, you don't need to buy equipment.
  • more risky: do a load test on less and cheaper equipment than will be in production and extrapolate the results. Maybe you have a theoretical model of how your algorithms scales, but beware of side effects. The proof of the pudding is in the eating.

Approaching VCs (as far as scalability is concerned)

  • Create a prototype that clearly self-explains your idea (not necessarily scalable)
  • Prove to yourself that everything will be ok at some point in the future and at what cost (min/expected/max one-time/continuous cost)
  • Start with a private beta, so that scalability is not an issue right from the start. No deadline to go out of beta. An estimate is ok, but no deadline. Don't compromise on that!

Good luck!

like image 196
pvoosten Avatar answered Sep 29 '22 11:09

pvoosten