Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFrecords occupy more space than original JPEG images

I'm trying to convert my Jpeg image set into to TFrecords. But TFrecord file is taking almost 5x more space than the image set. After a lot of googling, I learned that when JPEG are written into TFrecords, they aren't JPEG anymore. However I haven't come across an understandable code solution to this problem. Please tell me what changes ought to be made in the code below to write JPEG to Tfrecords.

def print_progress(count, total):
    pct_complete = float(count) / total
    msg = "\r- Progress: {0:.1%}".format(pct_complete)
    sys.stdout.write(msg)
    sys.stdout.flush()

def wrap_int64(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def wrap_bytes(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def convert(image_paths , labels, out_path):
    # Args:
    # image_paths   List of file-paths for the images.
    # labels        Class-labels for the images.
    # out_path      File-path for the TFRecords output file.

    print("Converting: " + out_path)

    # Number of images. Used when printing the progress.
    num_images = len(image_paths)

    # Open a TFRecordWriter for the output-file.
    with tf.python_io.TFRecordWriter(out_path) as writer:

        # Iterate over all the image-paths and class-labels.
        for i, (path, label) in enumerate(zip(image_paths, labels)):
            # Print the percentage-progress.
            print_progress(count=i, total=num_images-1)

            # Load the image-file using matplotlib's imread function.
            img = imread(path)
            # Convert the image to raw bytes.
            img_bytes = img.tostring()

            # Create a dict with the data we want to save in the
            # TFRecords file. You can add more relevant data here.
            data = \
            {
                'image': wrap_bytes(img_bytes),
                'label': wrap_int64(label)
            }

            # Wrap the data as TensorFlow Features.
            feature = tf.train.Features(feature=data)

            # Wrap again as a TensorFlow Example.
            example = tf.train.Example(features=feature)

            # Serialize the data.
            serialized = example.SerializeToString()

            # Write the serialized data to the TFRecords file.
            writer.write(serialized)

Edit: Can someone please answer this ?!!

like image 291
Uchiha Madara Avatar asked Mar 06 '23 20:03

Uchiha Madara


2 Answers

Instead of converting image to array and back to bytes, we can just use inbuilt open function to get the bytes. That way, compressed image will be written into TFRecord.

Replace these two lines

img = imread(path)
img_bytes = img.tostring()

with

img_bytes = open(path,'rb').read()

Reference :

https://github.com/tensorflow/tensorflow/issues/9675

like image 109
Uchiha Madara Avatar answered Mar 11 '23 09:03

Uchiha Madara


You shouldn't save the image data in the TFRecord, just the filename. Then, to load the image when the records are fed into the training loop, you would ideally use the relatively new Dataset API. From the docs:

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_jpeg(image_string)
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)

Which approach is faster? There are a number of competing factors here, such as:

  • Reading one large, continuous file may be faster than opening and reading many small files. But this will vary for SSDs, spinning disks, or networked storage.
  • Reading many small files may be more ameliorable to parallelisation
  • While reading 1000 files of size x may be slower than one file of size 1000x, we are actually discussing one large file of size 10 x 1000x because the image data is raw pixels, not jpeg.
  • BUT starting from pixel data saves the jpeg decoding step
  • Optimising read speed probably doesn't make much sense if it is not actually your bottleneck

So, in the end, it's important to know the different approaches. Without measurements, I'd tend towards the many-small-files solution because it requires less processing of the data we start out with, and because it is unlikely to be used on Tensorflow documentation if it were completely unreasonable. But the only real answer is to measure.

like image 43
Matthias Winkelmann Avatar answered Mar 11 '23 10:03

Matthias Winkelmann