I am new to python. I have a big array, a
, with dimensions such as (43200, 4000)
and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt
, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.
You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format. You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma.
compress() in Python. The numpy. compress() function returns selected slices of an array along mentioned axis, that satisfies an axis.
With a numpy array we need roughly 8 Byte per float. A linked list however requires roughly 32 Bytes per float. So switching from native Python to numpy reduces the required memory per floating point value by factor 4.
Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.
You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With