Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid computation every time a python module is reloaded

Tags:

python

nltk

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compute it? I tried cPickle, but loading the dictionary variable from a file(1.3M) takes approximately the same time as computation.

To give more information about my problem,

FD = FreqDist(word for word in brown.words()) # this line of code takes 1 min
like image 951
btw0 Avatar asked Oct 12 '08 15:10

btw0


2 Answers

Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.

However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.

For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:

# Create dict with a million items:
import shelve
d = shelve.open('path/to/my_persistant_dict')
d.update(('key%d' % x, x) for x in xrange(1000000))
d.close()

Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:

>>> d = shelve.open('path/to/my_persistant_dict')
>>> print d['key99999']
99999

It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.

like image 186
Brian Avatar answered Nov 14 '22 10:11

Brian


Calculate your global var on the first use.

class Proxy:
    @property
    def global_name(self):
        # calculate your global var here, enable cache if needed
        ...

_proxy_object = Proxy()
GLOBAL_NAME = _proxy_object.global_name

Or better yet, access necessery data via special data object.

class Data:
    GLOBAL_NAME = property(...)

data = Data()

Example:

from some_module import data

print(data.GLOBAL_NAME)

See Django settings.

like image 22
jfs Avatar answered Nov 14 '22 09:11

jfs