the file contains 2000000 rows: each row contains 208 columns, separated by comma, like this:
0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
The program read this file to a numpy narray, I expected it will consume about (2000000 * 208 * 8B) = 3.2GB
memory.
However, when the program read this file, I found the program consumes about 20GB memory.
I am confused about why my program consumes so much memory that do not meet expectation?
I'm using Numpy 1.9.0 and the memory inneficiency of np.loadtxt()
and np.genfromtxt()
seems to be directly related to the fact they are based on temporary lists to store the data:
np.loadtxt()
np.genfromtxt()
By knowing beforehand the shape
of your array you can think of a file reader that will consume an amount of memory very close to the theoretical amount of memory (3.2 GB for this case), by storing the data using the corresponding dtype
:
def read_large_txt(path, delimiter=None, dtype=None):
with open(path) as f:
nrows = sum(1 for line in f)
f.seek(0)
ncols = len(f.next().split(delimiter))
out = np.empty((nrows, ncols), dtype=dtype)
f.seek(0)
for i, line in enumerate(f):
out[i] = line.split(delimiter)
return out
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With