TFrecords occupy more space than original JPEG images

Question

I'm trying to convert my Jpeg image set into to TFrecords. But TFrecord file is taking almost 5x more space than the image set. After a lot of googling, I learned that when JPEG are written into TFrecords, they aren't JPEG anymore. However I haven't come across an understandable code solution to this problem. Please tell me what changes ought to be made in the code below to write JPEG to Tfrecords.

def print_progress(count, total):
    pct_complete = float(count) / total
    msg = "\r- Progress: {0:.1%}".format(pct_complete)
    sys.stdout.write(msg)
    sys.stdout.flush()

def wrap_int64(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def wrap_bytes(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def convert(image_paths , labels, out_path):
    # Args:
    # image_paths   List of file-paths for the images.
    # labels        Class-labels for the images.
    # out_path      File-path for the TFRecords output file.

    print("Converting: " + out_path)

    # Number of images. Used when printing the progress.
    num_images = len(image_paths)

    # Open a TFRecordWriter for the output-file.
    with tf.python_io.TFRecordWriter(out_path) as writer:

        # Iterate over all the image-paths and class-labels.
        for i, (path, label) in enumerate(zip(image_paths, labels)):
            # Print the percentage-progress.
            print_progress(count=i, total=num_images-1)

            # Load the image-file using matplotlib's imread function.
            img = imread(path)
            # Convert the image to raw bytes.
            img_bytes = img.tostring()

            # Create a dict with the data we want to save in the
            # TFRecords file. You can add more relevant data here.
            data = \
            {
                'image': wrap_bytes(img_bytes),
                'label': wrap_int64(label)
            }

            # Wrap the data as TensorFlow Features.
            feature = tf.train.Features(feature=data)

            # Wrap again as a TensorFlow Example.
            example = tf.train.Example(features=feature)

            # Serialize the data.
            serialized = example.SerializeToString()

            # Write the serialized data to the TFRecords file.
            writer.write(serialized)

Edit: Can someone please answer this ?!!

Uchiha Madara · Accepted Answer

Instead of converting image to array and back to bytes, we can just use inbuilt open function to get the bytes. That way, compressed image will be written into TFRecord.

Replace these two lines

img = imread(path)
img_bytes = img.tostring()

with

img_bytes = open(path,'rb').read()

Reference :

https://github.com/tensorflow/tensorflow/issues/9675

Matthias Winkelmann · Answer

You shouldn't save the image data in the TFRecord, just the filename. Then, to load the image when the records are fed into the training loop, you would ideally use the relatively new Dataset API. From the docs:

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_jpeg(image_string)
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)

Which approach is faster? There are a number of competing factors here, such as:

Reading one large, continuous file may be faster than opening and reading many small files. But this will vary for SSDs, spinning disks, or networked storage.
Reading many small files may be more ameliorable to parallelisation
While reading 1000 files of size x may be slower than one file of size 1000x, we are actually discussing one large file of size 10 x 1000x because the image data is raw pixels, not jpeg.
BUT starting from pixel data saves the jpeg decoding step
Optimising read speed probably doesn't make much sense if it is not actually your bottleneck

So, in the end, it's important to know the different approaches. Without measurements, I'd tend towards the many-small-files solution because it requires less processing of the data we start out with, and because it is unlikely to be used on Tensorflow documentation if it were completely unreasonable. But the only real answer is to measure.

TFrecords occupy more space than original JPEG images

Tags:

tensorflow

tfrecord

Uchiha Madara

2 Answers

Uchiha Madara

Matthias Winkelmann

Recent Activity

Donate For Us

TFrecords occupy more space than original JPEG images

Tags:

tensorflow

tfrecord

Uchiha Madara

2 Answers

Uchiha Madara

Matthias Winkelmann

Related questions

Recent Activity

Donate For Us