Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save big array so that it will take less memory in python?

Tags:

python

numpy

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?

Thanks.

like image 395
user2766019 Avatar asked Sep 10 '13 17:09

user2766019


People also ask

How do I save a large array in Python?

You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format. You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma.

How do you compress an array in Python?

compress() in Python. The numpy. compress() function returns selected slices of an array along mentioned axis, that satisfies an axis.

How much memory does an array use in Python?

With a numpy array we need roughly 8 Byte per float. A linked list however requires roughly 32 Bytes per float. So switching from native Python to numpy reduces the required memory per floating point value by factor 4.


2 Answers

Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.

Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.

like image 98
hackyday Avatar answered Sep 18 '22 14:09

hackyday


You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.

Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."

If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:

import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")

x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()

# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
                             # `usable_a_copy` affect the saved data.

copy_as_nparray = usable_a_copy.values

With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

like image 20
ely Avatar answered Sep 21 '22 14:09

ely