Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK on a production web application

I'd like to incorporate a custom tagger into a web application (running on Pyramid) I'm developing. I have the tagger working fine on my local machine using NLTK, but I've read that NLTK is relatively slow for production.

It seems that the standard way of storing the tagger is to Pickle it. On my machine, it takes a few seconds to load the 11.7MB pickle file.

  1. Is NLTK even practical for production? Should I be looking at scikit-learn or even something like Mahout?

  2. If NLTK is good enough, what is the best way to ensure that it properly uses memory, etc.?

like image 861
abroekhof Avatar asked Jan 02 '13 20:01

abroekhof


2 Answers

I run text-processing and its associated NLP APIs, and it uses about 2 dozen different pickled models, which are loaded by a Django app (gunicorn behind nginx). The models are loaded as soon as they are needed, and once loaded, they stay in memory. That means whenever I restart the gunicorn server, the first requests that need a model have to wait a few seconds for it load, but every subsequent request gets to use the model that's already cached in RAM. Restarts only happen when I deploy new features, which usually involves updating the models, so I'd need to reload them anyway. So if you don't expect to make code changes very often, and don't have strong requirements on consistent request times, then you probably don't need a separate daemon.

Other than the initial load time, the main limiting factor is memory. I currently only have 1 worker process, because when all the models are loaded into memory, a single process can take up to 1GB (YMMV, and for a single 11MB pickle file, your memory requirements will be much lower). Processing an individual request with an already loaded model is fast enough (usually <50ms) that I currently don't need more than 1 worker, and if I did, the simplest solution is to add enough RAM to run more worker processes.

If you are worried about memory, then do look into scikit-learn, since equivalent models can use significantly less memory than NLTK. But, they are not necessarily faster or more accurate.

like image 178
Jacob Avatar answered Nov 03 '22 00:11

Jacob


The best way to reduce start-up latency is to run the tagger as a daemon (persistent service) that your web app sends snippets of text to tag. That way your tagger loads only when the system boots up and if/when the daemon needs to be restarted.

Only you can decide if the NLTK is fast enough for your needs. Once the tagger is loaded, you've probably noticed that the NLTK can tag several pages of text without perceivable delay. But resource consumption and the number of concurrent users could complicate things.

like image 38
alexis Avatar answered Nov 03 '22 01:11

alexis