Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow: Feeding data with queue vs with direct feeding with feed_dict

I've been using feed_dict to direct feed a placeholder during practicing the coding in small problems like MNIST. TensorFlow also supports feeding data using queue and queue runner, and it need some effort to learn.

Has anybody did a comparison of these two methods and measure the performance? Is it worthy of spending time to learn using queue to feed data?

I guess using queue not only for performance, but also for cleaner code, what ever that means. Maybe the code for one dataset can be easily used for another dataset (once I convert data into TFRecord)?

However, this post seem to say queue can be slower than feed_dict method. Is it still true now? Why should I using queue if it's slower and harder to code?

Thanks for your inputs.

like image 405
Wei Liu Avatar asked Jul 17 '16 00:07

Wei Liu


3 Answers

I think the benefit you'll see is highly dependent on your problem. I saw a 3x speedup when I switched from feed_dict to a queue. There were at least two reasons it gave such a dramatic improvement in my case:

  1. The Python code that was generating the vectors to feed was pretty slow and unoptimized. For each training example, there were a lot of intermediate steps (allocating some numpy arrays, making a Pandas dataframe, calling a bunch of functions to compute/transform features). 25% of my total training time was spent generating the feed data.

  2. One of the reasons feed_dict can be slow is that it involves a memcpy of the fed data from the Python to the TF runtime. My examples were very large, so I took a big hit on this. (In my case, because my examples were seqeuences, and I was zero-padding them to a large maximum length before feeding them).

If you think either of these might apply to your problem, it's worth considering using queues.

like image 199
Coquelicot Avatar answered Nov 19 '22 10:11

Coquelicot


My NMT model has 2 layers, 512 hidden units. I train with maximum sentence length = 50, batch size = 32, and see similar speed between feed_dict and queue, about 2400-2500 target words per second (I use this metric for speed based on this paper).

I find feed_dict very intuitive and easy to use. Queue is difficult. Using queue, you have to:

1/ Convert your data into tfrecords. I actually gotta google a bit to understand how to convert my seq2seq data to tfrecords because the docs is not very helpful.

2/ Decode your data from tfrecords. You'll find functions used to generate tfrecords and decode it don't intuitively match. For example, if each of my training examples has 3 sequences (just 3 lists of integers) src_input, trg_input, trg_target and I want to record the length of the src_input too (some of its elements might be PADDINGs, so don't count), here is how to generate tfrecord from each example:

def _make_example(src_input, src_seq_length, trg_input, trg_seq_length, trg_target, target_weight):
    context = tf.train.Features(
        feature={
            'src_seq_length': int64_feature(src_seq_length)
        })
    feature_lists = tf.train.FeatureLists(
        feature_list={
            'src_input': int64_featurelist(src_input),
            'trg_input': int64_featurelist(trg_input),
            'trg_target': int64_featurelist(trg_target)
        })

    return tf.train.SequenceExample(context=context, feature_lists=feature_lists)  

And here's how to decode it:

def _read_and_decode(filename_queue):
    reader = tf.TFRecordReader(options=self.tfrecord_option)
    _, serialized_ex = reader.read(filename_queue)

    context_features = {
        'src_seq_length': tf.FixedLenFeature([], dtype=tf.int64)
    }
    sequence_features = {
        'src_input': tf.FixedLenSequenceFeature([], dtype=tf.int64),
        'trg_input': tf.FixedLenSequenceFeature([], dtype=tf.int64),
        'trg_target': tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }
    context, sequences = tf.parse_single_sequence_example(
        serialized_ex, 
        context_features=context_features, 
        sequence_features=sequence_features)

    src_seq_length = tf.cast(context['src_seq_length'], tf.int32)
    src_input = tf.cast(sequences['src_input'], tf.int32)
    trg_input = tf.cast(sequences['trg_input'], tf.int32)
    trg_target = tf.cast(sequences['trg_target'], tf.int32)

    return src_input, src_seq_length, trg_input, trg_target

And to generate each tfrecord feature/featurelist:

def int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def int64_featurelist(l):
    feature = [tf.train.Feature(int64_list=tf.train.Int64List(value=[x])) for x in l]
    return tf.train.FeatureList(feature=feature)  

http://gph.is/2cg7iKP

3/ Train/dev setup. I believe it's a common practice to periodically train your model for some time, then evaluate on dev set, then repeat. I don't know how to do this with queues. With feed_dict, you just build two graphs with shared parameters under the same session, one for train and one for dev. When you evaluate on dev set, just feed dev data to dev graph, that's it. But for queue, output from queue is part of the graph itself. To run queue, you have to start the queue runner, create a coordinator, use this coordinator to manage the queue. When it's done, the queue is close!!!. Currently, I have no idea how to best write my code to conform the train/dev setup with queues except opening new session, build new graph for dev each time I evaluate. The same issue was raised here , and you can google for similar questions on Stackoverflow.

However, a lot of people said that queue is faster than feed_dict. My guess is queue is beneficial if you train in distributed manner. But for me, I often train on 1 GPU only and so far I'm not impressed with queue at all. Well, just my guess.

like image 24
tnq177 Avatar answered Nov 19 '22 08:11

tnq177


Here is one benchmark:

BasicRNNCell unrolled to 20 time steps with 200 hidden units. I had 250k training examples and ran 1 epoch with a batch size of 20.

feed_dict: 597 seconds
queue: 591 seconds

This was with TF v1.0, on an i5 laptop (so 4 CPUs) w/ Ubuntu 16.04.

like image 4
Patrick Coady Avatar answered Nov 19 '22 10:11

Patrick Coady