Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create dataset similar to cifar-10 [closed]

Tags:

I want to create a dataset that has the same format as the cifar-10 data set to use with Tensorflow. It should have images and labels. I'd like to be able to take the cifar-10 code but different images and labels, and run that code.

like image 935
BlackyTheCat Avatar asked Jan 27 '16 08:01

BlackyTheCat


People also ask

How do I make a dataset like CIFAR-10?

If you really wanna it to work as it is, you need to study the function calls of CIFAR10 code. In cifar10_input, the batches are hardcoded. So you have to edit this line of code to fit the name of the bin file. Or, just distribute your images into 6 bin files evenly.

Do CIFAR-10 and cifar100 have same images?

The test sets of the popular CIFAR-10 and CIFAR-100 datasets contain 3.25% and 10% duplicate images, respectively, i.e., images that can also be found in very similar form in the training set or the test set itself.

What is format of CIFAR-10 dataset?

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images.


2 Answers

First we need to understand the format in which the CIFAR10 data set is in. If we refer to: https://www.cs.toronto.edu/~kriz/cifar.html, and specifically, the Binary Version section, we see:

the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Intuitively, we need to store the data in this format. What you can do next as sort of a baseline experiment first, is to get images that are exactly the same size and same number of classes as CIFAR10 and put them in this format. This means that your images should have a size of 32x32x3 and have 10 classes. If you can successfully run this, then you can go further on to factor cases like single channels, different size inputs, and different classes. Doing so would mean that you have to change many variables in the other parts of the code. You have to slowly work your way through.

I'm in the midst of working out a general module. My code for this is in https://github.com/jkschin/svhn. If you refer to the svhn_flags.py code, you will see many flags there that can be changed to accommodate your needs. I admit it's cryptic now, as I haven't cleaned it up such that it is readable, but it works. If you are willing to spend some time taking a rough look, you will figure something out.

This is probably the easy way to run your own data set on CIFAR10. You could of course just copy the neural network definition and implement your own reader, input format, batching, etc, but if you want it up and running fast, just tune your inputs to fit CIFAR10.

EDIT:

Some really really basic code that I hope would help.

from PIL import Image
import numpy as np

im = Image.open('images.jpeg')
im = (np.array(im))

r = im[:,:,0].flatten()
g = im[:,:,1].flatten()
b = im[:,:,2].flatten()
label = [1]

out = np.array(list(label) + list(r) + list(g) + list(b),np.uint8)
out.tofile("out.bin")

This would convert an image into a byte file that is ready for use in CIFAR10. For multiple images, just keep concatenating the arrays, as stated in the format above. To check if your format is correct, specifically for the Asker's use case, you should get a file size of 4274273 + 1 = 546988 bytes. Assuming your pictures are RGB and values range from 0-255. Once you verify that, you're all set to run in TensorFlow. Do use TensorBoard to perhaps visualize one image, just to guarantee correctness.

EDIT 2:

As per Asker's question in comments,

if not eval_data:
    filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i)
                 for i in xrange(1, 6)]

If you really wanna it to work as it is, you need to study the function calls of CIFAR10 code. In cifar10_input, the batches are hardcoded. So you have to edit this line of code to fit the name of the bin file. Or, just distribute your images into 6 bin files evenly.

like image 74
jkschin Avatar answered Sep 21 '22 03:09

jkschin


I didn't find any of the answers to do what I wanted to I made my own solution. It can be found on my github here: https://github.com/jdeepee/machine_learning/tree/master

This script will convert and amount of images to training and test data where the arrays are the same shape as the cifar10 dataset.

The code is commented so should be easy enough to follow. I should note it iterated through a master directory containing multiple folders which contain the images.

like image 41
Joshua Avatar answered Sep 19 '22 03:09

Joshua