TLDR; my question is on how to load compressed video frames from TFRecords.
I am setting up a data pipeline for training deep learning models on a large video dataset (Kinetics). For this I am using TensorFlow, more specifically the tf.data.Dataset
and TFRecordDataset
structures. As the dataset contains ~300k videos of 10 seconds, there is a large amount of data to deal with. During training, I want to randomly sample 64 consecutive frames from a video, therefore fast random sampling is important. For achieving this there are a number of data loading scenarios possible during training:
ffmpeg
or OpenCV
and sample frames. Not ideal as seeking in videos is tricky, and decoding video streams is much slower than decoding JPG.TFRecords
or HDF5
files. Requires more work getting the pipeline ready, but most likely to be the fastest of those options.I have decided to go for option (3) and use TFRecord
files to store a preprocessed version of the dataset. However, this is also not as straightforward as it seems, for example:
I have wrote the following code to preprocess the video dataset and write the video frames as TFRecord files (each of ~5GB in size):
def _int64_feature(value):
"""Wrapper for inserting int64 features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _bytes_feature(value):
"""Wrapper for inserting bytes features into Example proto."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
with tf.python_io.TFRecordWriter(output_file) as writer:
# Read and resize all video frames, np.uint8 of size [N,H,W,3]
frames = ...
features = {}
features['num_frames'] = _int64_feature(frames.shape[0])
features['height'] = _int64_feature(frames.shape[1])
features['width'] = _int64_feature(frames.shape[2])
features['channels'] = _int64_feature(frames.shape[3])
features['class_label'] = _int64_feature(example['class_id'])
features['class_text'] = _bytes_feature(tf.compat.as_bytes(example['class_label']))
features['filename'] = _bytes_feature(tf.compat.as_bytes(example['video_id']))
# Compress the frames using JPG and store in as bytes in:
# 'frames/000001', 'frames/000002', ...
for i in range(len(frames)):
ret, buffer = cv2.imencode(".jpg", frames[i])
features["frames/{:04d}".format(i)] = _bytes_feature(tf.compat.as_bytes(buffer.tobytes()))
tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tfrecord_example.SerializeToString())
This works fine; the dataset is nicely written as TFRecord files with the frames as compressed JPG bytes. My question regards, how to read the TFRecord files during training, randomly sample 64 frames from a video and decode the JPG images.
According to TensorFlow's documentation on tf.Data
we need to do something like:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
There are many example on how to do this with images, and that is quite straightforward. However, for video and random sampling of frames I am stuck. The tf.train.Features
object stores the frames as frame/00001
, frame/000002
etc. My first question is how to randomly sample a set of consecutive frames from this inside the dataset.map()
function? Considerations are that each frame has a variable number of bytes due to JPG compression and need to be decoded using tf.image.decode_jpeg
.
Any help how to best setup reading video sampels from TFRecord files would be appreciated!
Reading a TFRecord file. You can also read the TFRecord file using the tf. data. TFRecordDataset class.
The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching.
TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.
Encoding each frame as a separate feature makes it difficult to select frames dynamically, because the signature of tf.parse_example()
(and tf.parse_single_example()
) requires that the set of parsed feature names be fixed at graph construction time. However, you could try encoding the frames as a single feature that contains a list of JPEG-encoded strings:
def _bytes_list_feature(values):
"""Wrapper for inserting bytes features into Example proto."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))
with tf.python_io.TFRecordWriter(output_file) as writer:
# Read and resize all video frames, np.uint8 of size [N,H,W,3]
frames = ...
features = {}
features['num_frames'] = _int64_feature(frames.shape[0])
features['height'] = _int64_feature(frames.shape[1])
features['width'] = _int64_feature(frames.shape[2])
features['channels'] = _int64_feature(frames.shape[3])
features['class_label'] = _int64_feature(example['class_id'])
features['class_text'] = _bytes_feature(tf.compat.as_bytes(example['class_label']))
features['filename'] = _bytes_feature(tf.compat.as_bytes(example['video_id']))
# Compress the frames using JPG and store in as a list of strings in 'frames'
encoded_frames = [tf.compat.as_bytes(cv2.imencode(".jpg", frame)[1].tobytes())
for frame in frames]
features['frames'] = _bytes_list_feature(encoded_frames)
tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tfrecord_example.SerializeToString())
Once you have done this, it will be possible to slice the frames
feature dynamically, using a modified version of your parsing code:
def decode(serialized_example, sess):
# Prepare feature list; read encoded JPG images as bytes
features = dict()
features["class_label"] = tf.FixedLenFeature((), tf.int64)
features["frames"] = tf.VarLenFeature(tf.string)
features["num_frames"] = tf.FixedLenFeature((), tf.int64)
# Parse into tensors
parsed_features = tf.parse_single_example(serialized_example, features)
# Randomly sample offset from the valid range.
random_offset = tf.random_uniform(
shape=(), minval=0,
maxval=parsed_features["num_frames"] - SEQ_NUM_FRAMES, dtype=tf.int64)
offsets = tf.range(random_offset, random_offset + SEQ_NUM_FRAMES)
# Decode the encoded JPG images
images = tf.map_fn(lambda i: tf.image.decode_jpeg(parsed_features["frames"].values[i]),
offsets)
label = tf.cast(parsed_features["class_label"], tf.int64)
return images, label
(Note that I haven't been able to run your code, so there may be some small errors, but hopefully it is enough to get you started.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With