I'd like to incorporate a custom tagger into a web application (running on Pyramid) I'm developing. I have the tagger working fine on my local machine using NLTK, but I've read that NLTK is relatively slow for production.
It seems that the standard way of storing the tagger is to Pickle it. On my machine, it takes a few seconds to load the 11.7MB pickle file.
Is NLTK even practical for production? Should I be looking at scikit-learn or even something like Mahout?
If NLTK is good enough, what is the best way to ensure that it properly uses memory, etc.?
I run text-processing and its associated NLP APIs, and it uses about 2 dozen different pickled models, which are loaded by a Django app (gunicorn behind nginx). The models are loaded as soon as they are needed, and once loaded, they stay in memory. That means whenever I restart the gunicorn server, the first requests that need a model have to wait a few seconds for it load, but every subsequent request gets to use the model that's already cached in RAM. Restarts only happen when I deploy new features, which usually involves updating the models, so I'd need to reload them anyway. So if you don't expect to make code changes very often, and don't have strong requirements on consistent request times, then you probably don't need a separate daemon.
Other than the initial load time, the main limiting factor is memory. I currently only have 1 worker process, because when all the models are loaded into memory, a single process can take up to 1GB (YMMV, and for a single 11MB pickle file, your memory requirements will be much lower). Processing an individual request with an already loaded model is fast enough (usually <50ms) that I currently don't need more than 1 worker, and if I did, the simplest solution is to add enough RAM to run more worker processes.
If you are worried about memory, then do look into scikit-learn, since equivalent models can use significantly less memory than NLTK. But, they are not necessarily faster or more accurate.
The best way to reduce start-up latency is to run the tagger as a daemon (persistent service) that your web app sends snippets of text to tag. That way your tagger loads only when the system boots up and if/when the daemon needs to be restarted.
Only you can decide if the NLTK is fast enough for your needs. Once the tagger is loaded, you've probably noticed that the NLTK can tag several pages of text without perceivable delay. But resource consumption and the number of concurrent users could complicate things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With