I'm trying to load a couple of files into the memory. The files have either of the following 3 formats:
Indeed, they are ngram statics files, in case this helps with the solution. For instance:
i_love TAB 10
love_you TAB 12
Currently, the pseudocode of I'm doing right now is
loadData(file):
data = {}
for line in file:
first, second = line.split('\t')
data[first] = int(second) #or float(second)
return data
To much of my surprise, while the total size of the files in disk is about 21 mb, when loaded into memory the process takes 120 - 180 mb of memory! (the whole python application doesn't load any other data into memory).
There are less than 10 files, most of them would stay stable at about 50-80k lines, except for one file which currently has millions of lines.
So I would like to ask for a technique/data structure to reduce the memory consumption:
Thank you very much. I look forward to your advice.
The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.
This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.
The dictionary is stored in memory in the form of a hash table with an ordered array of ranges and their corresponding values. The dictionary key has the UInt64 type. This storage method works the same way as hashed and allows using date/time (arbitrary numeric type) ranges in addition to the key.
I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory.
If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string and integer data types, for example here (specifically object.h, intobject.h, stringobject.h and dictobject.h, as well as the corresponding *.c files in ../Objects), you can calculate with some accuracy the expected space requirements:
An integer is a fixed-sized object, i.e. it contains a reference count, a type pointer and the actual integer, in total typically at least 12 bytes on a 32bit system and 24 bytes on a 64bit system, not taking into account extra space possibly lost through alignment.
A string object is variable-sized, which means it contains
reference count
type pointer
size information
space for the lazily calculated hash code
state information (e.g. used for interned strings)
a pointer to the dynamic content
in total at least 24 bytes on 32bit or 60 bytes on 64bit, not including space for the string itself.
the hash code of the object currently stored (that is not predictable from the position of the bucket due to the collision resolution strategy used)
a pointer to the key object
a pointer to the value object
in total at least 12 bytes on 32bit and 24 bytes on 64bit.
I carried out a test with a list of 46,461 unique strings (337,670 bytes concatenated string size), each associated with an integer — similar to your setup, on a 32-bit machine. According to the calculation above, I would expect a minimum memory footprint of
in total 2.65 MB. (For a 64-bit system the corresponding calculation yields 5.5 MB.)
When running the Python interpreter idle, its footprint according to the ps
-tool is 4.6 MB. So the total expected memory consumption after creating the dictionary is approximately 4.6 + 2.65 = 7.25 MB. The true memory footprint (according to ps
) in my test was 7.6 MB. I guess the extra ca. 0.35 MB were consumed by overhead generated through Python's memory allocation strategy (for memory arenas etc.)
Of course many people will now point out that my use of ps
to measure the memory footprint is inaccurate and my assumptions about the size of pointer types and integers on 32-bit and 64-bit systems may be wrong on many specific systems. Granted.
But, nevertheless, the key conclusions, I believe, are these:
1) SQLite in memory sounds like a great solution, it'll let you query your data more easily once it's loaded which is a pleasure
sqlite3.connect(':memory:')
2) you probably want a named tuple - I'm pretty sure they're lighter than dictionaries and you can access properties using dot notation (for which I have an aesthetic preference anyway).
http://docs.python.org/dev/library/collections
3) you may want to have a look at Redis: https://github.com/andymccurdy/redis-py
It is FAST and will let you persist things easily, meaning you don't have to load the whole set in every time you want to use it.
4) a trie sounds like a good idea, but adds some theoretical complexity on your end of the work. You can use Redis to implement and store it though, which will boost your speed even further.
But overall, named tuples are probably the trick here.
In disk you have just strings, when loading to Python the interpreter has to create a whole structure for each string and for each dictionary, besides the string itself.
There's no way to reduce the memory used by the dicts, but there are other ways to approach the problem. If you're willing to trade some speed for memory, you should consider storing and querying the strings from an SQLite file instead of loading everything to dictionaries in memory.
Sounds like a Trie ( http://en.m.wikipedia.org/wiki/Trie) data structure might better suit your desire for memory efficiency.
Update: the memory efficiency of python dict has been raised as an issue, though it was rejected from the standard lib given the availability of third party libraries. See: http://bugs.python.org/issue9520
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With