I have a large (1 TB) set of data split over about 3,000 CSV files. My plan is to convert it to one large LMDB file so it can be read quickly for training a neural network. However, I have not been able to find any documentation on how to load an LMDB file into TensorFlow. Does anyone know how to do this? I know TensorFlow can read CSV files, but I believe that would be too slow.
According to this there are several ways to read data in TensorFlow.
The simplest one is to feed your data through placeholders. When using placeholders - the responsibility for shuffling and batching is on you.
If you want to delegate shuffling and batching to the framework then you need to create an input pipeline. The problem is this - how do you inject lmdb data into the symbolic input pipeline. A possible solution is to use the tf.py_func
operation. Here is an example:
def create_input_pipeline(lmdb_env, keys, num_epochs=10, batch_size=64):
key_producer = tf.train.string_input_producer(keys,
num_epochs=num_epochs,
shuffle=True)
single_key = key_producer.dequeue()
def get_bytes_from_lmdb(key):
with lmdb_env.begin() as txn:
lmdb_val = txn.get(key)
example = get_example_from_val(lmdb_val) # A single example (numpy array)
label = get_label_from_val(lmdb_val) # The label, could be a scalar
return example, label
single_example, single_label = tf.py_func(get_bytes_from_lmdb,
[single_key], [tf.float32, tf.float32])
# if you know the shapes of the tensors you can set them here:
# single_example.set_shape([224,224,3])
batch_examples, batch_labels = tf.train.batch([single_example, single_label],
batch_size)
return batch_examples, batch_labels
The tf.py_func
op inserts a call to regular python code inside of the TensorFlow graph, we need to specify the inputs and the number and types of the outputs. The tf.train.string_input_producer
creates a shuffled queue with the given keys. The tf.train.batch
op create another queue that contains batches of data. When training, each evaluation of batch_examples
or batch_labels
will dequeue another batch from that queue.
Because we created queues we need to take care and run the QueueRunner
objects before we start training. This is done like this (from the TensorFlow doc):
# Create the graph, etc.
init_op = tf.initialize_all_variables()
# Create a session for running operations in the Graph.
sess = tf.Session()
# Initialize the variables (like the epoch counter).
sess.run(init_op)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
# Run training steps or whatever
sess.run(train_op)
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
# When done, ask the threads to stop.
coord.request_stop()
# Wait for threads to finish.
coord.join(threads)
sess.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With