Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the `my_input_fn` in the getting started with TensorFlow allow enumeration over the data?

I'm looking at first steps with tensor flow as part of the google machine learning crahs course and already confused. My understanding is (please correct me if I'm wrong):

  • Step 4 defines an input function my_input_fn that formats the data into the relevant TensorFlow structuresTensor
  • Step 5 then supplies this function into the train call.
  • The intention is that the train call will make successive calls to my_input_fn to get successive batches of data to adjust the model on. (??? vert suspect on this now)

my_input_fn is defined here:

def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """Trains a linear regression model of one feature.

    Args:
      features: pandas DataFrame of features
      targets: pandas DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """

    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(features).items()}                                           

    # Construct a dataset, and configure batching/repeating
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)

    # Shuffle the data, if specified
    if shuffle:
      ds = ds.shuffle(buffer_size=10000)

    # Return the next batch of data
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

From my reading of my_input_fn, I don't understand how this happens. I only have a rudimentary knowledge of python but my reading of the function is that every call to it will re-initialise the tensor structures from the pandas frames, get an iterator and then return the first element of it. Every time it is called. Sure, in the case of this example, if the data is shuffled (which it is by default) and the dataset is big it's unlikely you'll get duplicates for a step of 100, but this smells of sloppy programming (i.e. in the case it isn't shuffled it would always return the same first training data set) so I doubt this is the case.

My next suspicion is that the one_shot_iterator().get_next() call is doing some interesting/wacky/tricky stuff. Like returning some sort of late eval structure that will allow the train function to enumerate to the next batch from itself as opposed to re-invoking my_input_fn?

But honestly I'd like to clarify this because at this stage more hours than I care to think about later I am not any closer to understanding.

My attempts to research have just led to further confusion.

The tutorial suggests reading this - at one point it this says "The train, evaluate, and predict methods of every Estimator require input functions to return a (features, label) pair containing tensorflow tensors.". Okay, this is inline with my original thoughts. Basically the example and label packaged in TensorFlow structures.

But then it shows the results of what it returns and it is stuff like this (example):

({
    'SepalLength': <tf.Tensor 'IteratorGetNext:2' shape=(?,) dtype=float64>,
    'PetalWidth': <tf.Tensor 'IteratorGetNext:1' shape=(?,) dtype=float64>,
    'PetalLength': <tf.Tensor 'IteratorGetNext:0' shape=(?,) dtype=float64>,
    'SepalWidth': <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=float64>},
Tensor("IteratorGetNext_1:4", shape=(?,), dtype=int64))

In the code lab, my_input_fn(my_feature, targets) returns:

({'total_rooms': <tf.Tensor 'IteratorGetNext:0' shape=(?,) dtype=float64>},

)

I have NO IDEA what to make of this. My reading of tensors does not mention anything like this. I don't even know how to BEGIN interrogating this with my rudimentary Python and non-existent TensorFlow knowledge.

The documentation for the one shot iterator says it creates an iterator for enumerating the elements. Again, this is in line with my thinking.

The get_next documentation says:

Returns a nested structure of tf.Tensors containing the next element.

I don't know how to parse this. What sort of nested structure? I mean it looks like a tuple but why wouldn't you just say tuple? What dictates this? Where is it described? Surely it is important?

What am I misunderstanding here?

(For a course that purportedly requires no prior knowledge of TensorFlow, the google machine learning crash course is making me feel pretty moronic. I'm genuinely curious as to how others in my situation are going with this.)

like image 394
fostandy Avatar asked Mar 20 '18 18:03

fostandy


1 Answers

The input function (in this case my_input_function) is not called repeatedly. It is called once, creates a bunch of tensorflow ops (for creating a dataset, shuffling it etc.) and finally returns the get_next op of the iterator. This op will be called repeatedly, but all it does is iterate over the dataset. The things you do in my_input_function (such as shuffling, batching, repeating) only happen once.

In general: When working with Tensorflow programs, you have to get used to the fact that they work quite differently from "normal" Python programs. Most of the code you write (especially things with tf. in front) will only be executed once to build the computation graph, and then this graph is executed many times.
EDIT: However, there is the experimental tf.eager API (supposedly becoming fully integrated in TF 1.7) that changes exactly this, i.e. things are executed as you write them (more like numpy). This should allow for faster experimentation.

To go through the input dunction step by step: You start out with a dataset that you create from "tensor slices" (e.g. numpy arrays). Then you call the batch method. This essentially creates a new dataset, the elements of which are batches of elements of the original dataset. Similarly, repeating and shuffling also creates new datasets (to be precise, they create ops that will create these datasests once they're actually executed as part of the computation graph). Finally, you return an iterator over the batched, repeated, shuffled dataset. Only this iterator's get_next op will execute repeatedly, returning new elements of the dataset until it is exhausted.
EDIT: Indeed iterator.get_next() only returns an op. The iteration is performed only once this op is run in a tf.Session.

As for the output that you have "no idea what to make of": Not sure what your question is exactly, but what you posted are just dicts mapping strings to tensors. The tensors automatically get names related to the op that produces them (iterator.get_next), and their shape is not known because batch size can be variable -- even specifying it, the last batch could be smaller if the batch size doesn't evenly divide the dataset size (e.g. dataset with 10 elements and batch size of 4 -- last batch is going to be size 2). ? elements in tensor shapes signify unknown dimensions.
EDIT: Regarding the naming: The ops receive default names. However they would all receive the same default name (IteratorGetNext) in this case, but there cannot be multiple ops with the same name. So Tensorflow automatically appends integers to make the names unique. That's all!

As for "nested structures": Input functions are often used with tf.estimator which expects a fairly simple input structure (a tuple containing a Tensor or dict of Tensors as input, and a Tensor as output if I'm not mistaken). However in general, input functions support more complex, nested output structures such as (a, (tuple, of), (tuples, (more, tuples, elements), and), words). Note that this is the structure of one output, i.e. one "step" of the iterator (e.g. a batch of data). Repeatedly calling this op will enumerate the whole dataset.
EDIT: What structure is returned by an input function is determined by just that function! E.g. a dataset from tensor slices will return tuples, where the nth element is the nth "tensor slice". There are functions such as dataset.zip that works just like the Python equivalent. If you would take a dataset with structure (e1, e2) and zip it with a dataset (e3,) you would get ((e1, e2), e3).
What format is needed depends on the application. In principle you could provide any format and then the code that receives this input could do anything with it. However, as I said, probably the most common use is in the context of tf.estimator, and there your input function is supposed to return a tuple (features, labels) where features is either a tensor or dict of tensors (as in your case) and labels is also a tensor or dict of tensors. If either is a dict, the model function is responsible for grabbing the correct values/tensors from there.

In general, I would advise you to play around with this stuff. Check out the tf.data API and of course the Programmer's Guide. Create some datasets/input functions and simply start a session and repeatedly run the iterator.get_next() op. See what comes out of there. Try all the different transformations such as zip, take, padded_batch... Seeing it in action without the need to actually do anything with this data should give you a better understanding.

like image 93
xdurch0 Avatar answered Oct 11 '22 16:10

xdurch0