I have the following code:
data = np.load("data.npy")
print(data) # Makes sure the array gets loaded in memory
dataset = tf.contrib.data.Dataset.from_tensor_slices((data))
The file "data.npy"
is 3.3 GB. Reading the file with numpy takes a couple of seconds but then the next line that creates the tensorflow dataset object takes ages to execute. Why is that? What is it doing under the hood?
Quoting this answer:
np.load
of anpz
just returns a file loader, not the actual data. It's a 'lazy loader', loading the particular array only when accessed.
That is why it is fast.
Edit 1: to expand a bit more this answer, another quote from tensorflow's documentation:
If all of your input data fit in memory, the simplest way to create a
Dataset
from them is to convert them totf.Tensor
objects and useDataset.from_tensor_slices()
.This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
The link also shows how to do it a efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With