Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert string labels to one-hot vectors in TensorFlow?

I'm new to TensorFlow and would like to read a comma separated values (csv) file, containing 2 columns, column 1 the index, and column 2 a label string. I have the following code which reads lines in the csv file line by line and I am able to get the data in the csv file correctly using print statements. However, I would like to do one-hot encoding conversion from the string labels and do not how to do it in TensorFlow. The final goal is to use the tf.train.batch() function so I can get batches of one-hot label vectors to train a neural network.

As you can see in the code below, I can create a one-hot vector for each of the label entries manually within a TensorFlow session. But how do I use the tf.train.batch() function? If I move the line

label_batch = tf.train.batch([col2], batch_size=5)

into the TensorFlow session block (replacing col2 with label_one_hot), the program blocks doing nothing. I tried to move the one-hot vector conversion outside the TensorFlow session but I failed to get it to work correctly. What is the correct way to do it? Please help.

label_files = []
label_files.append(LABEL_FILE)
print "label_files: ", label_files

filename_queue = tf.train.string_input_producer(label_files)

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
print "key:", key, ", value:", value

record_defaults = [['default_id'], ['default_label']]
col1, col2 = tf.decode_csv(value, record_defaults=record_defaults)

num_lines = sum(1 for line in open(LABEL_FILE))

label_batch = tf.train.batch([col2], batch_size=5)

with tf.Session() as sess:
    coordinator = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coordinator)

    for i in range(100):
        column1, column2 = sess.run([col1, col2])

        index = 0
        if column2 == 'airplane':
            index = 0
        elif column2 == 'automobile':
            index = 1
        elif column2 == 'bird':
            index = 2
        elif column2 == 'cat':
            index = 3
        elif column2 == 'deer':
            index = 4
        elif column2 == 'dog':
            index = 5
        elif column2 == 'frog':
            index = 6
        elif column2 == 'horse':
            index = 7
        elif column2 == 'ship':
            index = 8
        elif column2 == 'truck':
            index = 9

        label_one_hot = tf.one_hot([index], 10)  # depth=10 for 10 categories
        print "column1:", column1, ", column2:", column2
        # print "onehot label:", sess.run([label_one_hot])

    print sess.run(label_batch)

    coordinator.request_stop()
    coordinator.join(threads)
like image 956
so_user Avatar asked Apr 25 '17 06:04

so_user


2 Answers

It's been more than 2 years since this question was asked, but this answer might still be relevant for some. Here's one simple way to transform string labels into one-hot vectors in TF:

import tensorflow as tf

vocab = ['a', 'b', 'c']

input = tf.placeholder(dtype=tf.string, shape=(None,))
matches = tf.stack([tf.equal(input, s) for s in vocab], axis=-1)
onehot = tf.cast(matches, tf.float32)

with tf.Session() as sess:
    out = sess.run(onehot, feed_dict={input: ['c', 'a']})
    print(out) # prints [[0. 0. 1.]
               #         [1. 0. 0.]]
like image 64
rvinas Avatar answered Nov 17 '22 22:11

rvinas


You may want to try to feed your index variable into a placeholder, which, in turn gets transformed into a one-hot vector via tf.one_hot? Something along these lines:

lbl = tf.placeholder(tf.uint8, [YOUR_BATCH_SIZE])
lbl_one_hot = tf.one_hot(lbl, YOUR_VOCAB_SIZE, 1.0, 0.0)
lb_h = sess.run([lbl_one_hot], feed_dict={lbl: index})

Not sure if you are doing things in batches, so if not in your case YOUR_BATCH_SIZE might be irrelevant. You can also do it using numpy.zeros, but I find the above cleaner and easier, especially with batching.

like image 39
VS_FF Avatar answered Nov 17 '22 23:11

VS_FF