Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Dataset.from_tensor_slices taking too long

I have the following code:

data = np.load("data.npy")
print(data) # Makes sure the array gets loaded in memory
dataset = tf.contrib.data.Dataset.from_tensor_slices((data))

The file "data.npy" is 3.3 GB. Reading the file with numpy takes a couple of seconds but then the next line that creates the tensorflow dataset object takes ages to execute. Why is that? What is it doing under the hood?

like image 416
niko Avatar asked Oct 20 '17 19:10

niko


1 Answers

Quoting this answer:

np.load of a npz just returns a file loader, not the actual data. It's a 'lazy loader', loading the particular array only when accessed.

That is why it is fast.

Edit 1: to expand a bit more this answer, another quote from tensorflow's documentation:

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.

The link also shows how to do it a efficiently.

like image 191
Julio Daniel Reyes Avatar answered Oct 10 '22 00:10

Julio Daniel Reyes