I am building a python based web service that provides natural language processing support to our main app API. Since it's so NLP heavy, it requires unpickling a few very large (50-300MB) corpus files from the disk before it can do any kind of analyses.
How can I load these files into memory so that they are available to every request? I experimented with memcached and redis but they seem designed for much smaller objects. I have also been trying to use the Flask g object, but this only persists throughout one request.
Is there any way to do this while using a gevent (or other) server to allow concurrent connections? The corpora are completely read-only so there ought to be a safe way to expose the memory to multiple greenlets/threads/processes.
Thanks so much and sorry if it's a stupid question - I've been working with python for quite a while but I'm relatively new to web programming.
If you are using Gevent you can have your read-only data structures in the global scope of your process and they will be shared by all the greenlets. With Gevent your server will be contained in a single process, so the data can be loaded once and shared among all the worker greenlets.
A good way to encapsulate access to the data is by putting access function(s) or class(es) in a module. You can do the unpicliking of the data when the module is imported, or you can trigger this task the first time someone calls a function into the module.
You will need to make sure there is no possibility of introducing a race condition, but if the data is strictly read-only you should be fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With