I have a python list that's huge (16 GB) and I want to convert it to numpy array, inplace. I can't afford this statement
huge_array = np.array(huge_list).astype(np.float16)
I'm looking for some efficient ways to transform this huge_list into numpy array without making it's copy.
Can anyone suggest an efficient method to do this? that might involve saving the list to disk first and then loading it as numpy array, I'm ok with that.
I'll highly appreciate any help.
EDIT 1 : huge_list is an in memory python list that's created on runtime so it's already taking 16GB. I need to convert it to numpy float16 array.
np.array(huge_list, dtype=np.float16) will be faster, since it only copies the list once and not twice
You probably don't need to worry about making this copy, because the copy is a lot smaller than the original:
>>> x = [float(i) for i in range(10000)]
>>> sys.getsizeof(x)
83112
>>> y = np.array(x, dtype=np.float16)
>>> sys.getsizeof(y)
20096
But that's not even the worst of it - with the python list, each number in the list is taking up memory of its own:
>>> sum(sys.getsizeof(i) for i in x)
240000
So the numpy array is ~15x smaller!
As I previously mentioned, the easiest would be to just dump the array to a file and then load that file as a numpy array.
First we need the size of the huge list:
huge_list_size = len(huge_list)
Next we dump it to disk
dumpfile = open('huge_array.txt', 'w')
for item in huge_list:
dumpfile.write(str(item)+"\n")
dumpfile.close()
Ensure we clear the memory if this all happens in the same environment
del huge_list
Next we define a simple read generator
def read_file_generator(filename):
with open(filename) as infile:
for i, line in enumerate(infile):
yield [i, line]
And then we create a numpy array of zeros, which we fill with the generator we just created
huge_array = np.zeros(huge_list_size, dtype='float16')
for i, item in read_file_generator('huge_array.txt'):
huge_array[i] = item
My previous answer was incorrect. I suggested the following to be a solution, which it is not as commented by hpaulj
You can do this in a multiple ways, the easiest would be to just dump the array to a file and then load that file as a numpy array:
dumpfile = open('huge_array.txt', 'w') for item in huge_array: print>>dumpfile, itemThen load it as a numpy array
huge_array = numpy.loadtxt('huge_array.txt')If you want to perform further computations on this data you can also use the joblib library for memmapping, which is extremely usefull in handling large numpy array cmputations. Available at https://pypi.python.org/pypi/joblib
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With