Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting good mixing with many input datafiles in tensorflow

I'm working with tensorflow hoping to train a deep CNN to do move prediction for the game Go. The dataset I created consists of 100,000 binary data files, where each datafile corresponds to a recorded game and contains roughly 200 training samples (one for each move in the game). I believe it will be very important to get good mixing when using SGD. I'd like my batches to contain samples from different games AND samples from different stages of the games. So for example simply reading one sample from the start of 100 files and shuffling isn't good b/c those 100 samples will all be the first move of each game.

I have read the tutorial on feeding data from files but I'm not sure if their provided libraries do what I need. If I were to hard code it myself I would basically initialize a bunch of file pointers to random locations within each file and then pull samples from random files, incrementing the file pointers accordingly.

So, my question is does tensorflow provide this sort of functionality or would it be easier to write my own code for creating batches?

like image 941
ScoobySnacks Avatar asked Dec 14 '15 00:12

ScoobySnacks


1 Answers

Yes - what you want is to use a combination of two things. (Note that this answer was written for TensorFlow v1, and some of the functionality has been replaced by the new tf.data pipelines; I've updated the answers to point to the v1 compat versions of things, but if you're coming to this answer for new code, please consult tf.data instead.)

First, randomly shuffle the order in which you input your datafiles, by reading from them using a tf.train.string_input_producer with shuffle=True that feeds into whatever input method you use (if you can put your examples into tf.Example proto format, that's easy to use with parse_example). To be very clear, you put the list of filenames in the string_input_producer and then read them with another method such as read_file, etc.

Second, you need to mix at a finer granularity. You can accomplish this by feeding the input examples into a tf.train.shuffle_batch node with a large capacity and large value of min_after_dequeue. One particularly nice way is to use a shuffle_batch_join that receives input from multiple files, so that you get a lot of mixing. Set the capacity of the batch big enough to mix well without exhausting your RAM. Tens of thousands of examples usually works pretty well.

Keep in mind that the batch functions add a QueueRunner to the QUEUE_RUNNERS collection, so you need to run tf.train.start_queue_runners()

like image 137
dga Avatar answered Oct 21 '22 16:10

dga