Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Effective storage of data in memory

I have the following dictionary structure ( 10,000 keys with values being formed by list of lists )

my_dic ={0: [[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1,   2.5]], 
1:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]] .....   
10,000th key:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]]}

(Note: Data is dummy, so I have just repeated it across the keys)

The logical data types that I want in the smaller elementry list are

inner_list = [int, float, small_int, boolean( 0 or 1), float]

A sys.getsizeof(inner_list), shows it's size to be 56 bytes. Adding 12 bytes for the int key makes it 68 bytes. Now, since I have 10^8 such lists ( 10000*10000) it's storage in memory is becoming a big problem. I want the data in memory ( no DB as of now ). What should be the most optimized method of storing it ? I am inclined to think that it must have something to do with numpy but not sure on what would be the best method and how to implement it. Any suggestions ?

2) Also, since I am storing these dictionaries in memory, I would like to clear the memory occupied by them as soon as I am done using them. Is there a way of doing this in python?

like image 695
R.Bahl Avatar asked Dec 02 '25 02:12

R.Bahl


1 Answers

One idea is to break up the dictionary structure into simpler structures, but it may affect how efficiently you can process it.

1 Create separate array for the keys

keys = array('i', [key1, key2, ..., key10000])

Depending on the possible values of the keys, you can further specify the particular int type for the array. Also, the keys should be ordered, so you could perform binary search on the key table. This way you also save some space from the hash table used in the Python dictionary implementation. Downside is that key lookup now takes O(logn) time instead of O(1).

2 Store inner_list elements in a 10000x10000 matrices or in a 100000000 length lists

As each position i from 0 to 9999 corresponds to a specific key that can be obtained from keys array, each list of lists can be put into i'th row in the matrix and each inner_list elements in columns of the row.

Other option is to put them in a long list and index using the key position i such that

idx = i*10000 + j

where i is the index of key in keys array and j is the index of particular inner_list instance.

Additionally, for each inner_list element you can have total of five separate arrays, which somewhat breaks the locality of the data in memory

int_array = array('i', [value1, ..., value100000000])
float1_array = array('f', [value1, ..., value100000000])
small_int_array = array('h', [value1, ..., value100000000])
bool_array = array('?', [value1, ..., value100000000])
float2_array = array('f', [value1, ..., value100000000])

Boolean array can be further optimized by packing them into bits.

Alternative is also to pack inner_list elements in a binary string using struct module and store them in a single list instead of five different lists.

3 Releasing memory

As soon as the variables go out of scope, they are ready to be garbage collected, so the memory can be claimed back. To do this sooner, for example in a function or a loop, you may just replace a list with a dummy value to bring the reference count of the variable down to zero.

variable = None

Note

However, these ideas may not be good enough for you particular solution. There are other possibilities too, such as loading only some part of the data in memory. It depends, how do you plan to process it.

Generally Python takes its own share of the memory for internal handling of the pointers/structures. Therefore, yet another alternative is to implement that particular data strucure and its handling in a language like Fortran, C or C++, which can be more easily tuned for your particular needs.

like image 139
Timo Avatar answered Dec 03 '25 17:12

Timo