Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing the memory size of a list for multiprocessing.Pool.starmap()

So I have created a list for multiprocessing stuff(in particular, it is multiprocessing.Pool().starmap()) and want to reduce its memory size. The list is the following:

import sys
import numpy as np
from itertools import product

lst1 = np.arange(1000)
lst3 = np.arange(0.05, 4, 0.05)

lst1_1 = list(product(enumerate(lst3),
                      (item for item in product(lst1, lst1) if item[0] < item[1])
                      ))

Its memory size calculated from sys.getsizeof(lst1_1) is 317840928


Seeing that the type of lst1 is int32, I thought changing the dtype of the lst to int16 can reduce the memorysize of lst1 and, consequently, ls1_1 by a half since int16 takes up half the memory as int32 data, so I did the following:

lst2 = np.arange(1000, dtype = np.int16)
lst2_1 = list(product(enumerate(lst3),
                      (item for item in product(lst2, lst2) if item[0] < item[1])
                      ))

Surprisingly, the memory size of lst2_1 calculated by sys.getsizeof(lst2_1) is still 317840928.


My questions are the following:

1) Is the memory size of the list independent of the datatype of the source data?

2) If so, then what's the best way to reduce the memory size of the list without converting to a generator?

Note that the reason why converting to a generator won't help is because even if it gets converted to a generator, when it is thrown into multiprocessing.Pool().starmap(), it gets converted back to a list anyway.

like image 295
mathguy Avatar asked Jul 23 '19 00:07

mathguy


1 Answers

You are converting the arrays to Python Lists before you check the size of these arrays. The integers inside are converted to Python objects. When you do that, it results in a much larger size. Here is an example behavior of your code.

import sys
import numpy as np

lst1 = np.arange(1000)
lst2 = np.arange(1000, dtype = np.int16)

print(sys.getsizeof(lst1)) # 4096
print(sys.getsizeof(lst2)) # 2096
print(sys.getsizeof(list(lst1))) # 9112
print(sys.getsizeof(list(lst2))) # 9112

Numpy is a C based library, so you can choose which integer type to use (just like int, long, long long). You need your data to stay in C-type so that those advantages can be preserved. That's why Numpy has so many functions in itself, keeping the operations and the data at a lower level.

like image 80
Rockybilly Avatar answered Oct 20 '22 07:10

Rockybilly