Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you load an LMDB file into TensorFlow?

I have a large (1 TB) set of data split over about 3,000 CSV files. My plan is to convert it to one large LMDB file so it can be read quickly for training a neural network. However, I have not been able to find any documentation on how to load an LMDB file into TensorFlow. Does anyone know how to do this? I know TensorFlow can read CSV files, but I believe that would be too slow.

like image 745
user1389840 Avatar asked May 20 '16 03:05

user1389840


1 Answers

According to this there are several ways to read data in TensorFlow.

The simplest one is to feed your data through placeholders. When using placeholders - the responsibility for shuffling and batching is on you.

If you want to delegate shuffling and batching to the framework then you need to create an input pipeline. The problem is this - how do you inject lmdb data into the symbolic input pipeline. A possible solution is to use the tf.py_func operation. Here is an example:

def create_input_pipeline(lmdb_env, keys, num_epochs=10, batch_size=64):
   key_producer = tf.train.string_input_producer(keys, 
                                                 num_epochs=num_epochs,
                                                 shuffle=True)
   single_key = key_producer.dequeue()

   def get_bytes_from_lmdb(key):
      with lmdb_env.begin() as txn:
         lmdb_val = txn.get(key)
      example = get_example_from_val(lmdb_val) # A single example (numpy array)
      label = get_label_from_val(lmdb_val)     # The label, could be a scalar
      return example, label

   single_example, single_label = tf.py_func(get_bytes_from_lmdb,
                                             [single_key], [tf.float32, tf.float32])
   # if you know the shapes of the tensors you can set them here:
   # single_example.set_shape([224,224,3])

   batch_examples, batch_labels = tf.train.batch([single_example, single_label],
                                                 batch_size)
   return batch_examples, batch_labels

The tf.py_func op inserts a call to regular python code inside of the TensorFlow graph, we need to specify the inputs and the number and types of the outputs. The tf.train.string_input_producer creates a shuffled queue with the given keys. The tf.train.batch op create another queue that contains batches of data. When training, each evaluation of batch_examples or batch_labels will dequeue another batch from that queue.

Because we created queues we need to take care and run the QueueRunner objects before we start training. This is done like this (from the TensorFlow doc):

# Create the graph, etc.
init_op = tf.initialize_all_variables()

# Create a session for running operations in the Graph.
sess = tf.Session()

# Initialize the variables (like the epoch counter).
sess.run(init_op)

# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
    while not coord.should_stop():
        # Run training steps or whatever
        sess.run(train_op)

except tf.errors.OutOfRangeError:
    print('Done training -- epoch limit reached')
finally:
    # When done, ask the threads to stop.
    coord.request_stop()

# Wait for threads to finish.
coord.join(threads)
sess.close()
like image 162
Elhanan Ilani Avatar answered Sep 30 '22 16:09

Elhanan Ilani