Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I change the dtype in TensorFlow for a csv file?

Here is the code that I am trying to run-

import tensorflow as tf
import numpy as np
import input_data

filename_queue = tf.train.string_input_producer(["cs-training.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

record_defaults = [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.concat(0, [col2, col3, col4, col5, col6, col7, col8, col9, col10, col11])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    print i
    example, label = sess.run([features, col1])
    try:
        print example, label
    except:
        pass

  coord.request_stop()
  coord.join(threads)

This code return the error below.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-23-e42fe2609a15> in <module>()
      7     # Retrieve a single instance:
      8     print i
----> 9     example, label = sess.run([features, col1])
     10     try:
     11         print example, label

/root/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict)
    343 
    344     # Run request and get response.
--> 345     results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
    346 
    347     # User may have fetched the same tensor multiple times, but we

/root/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_run(self, target_list, fetch_list, feed_dict)
    417         # pylint: disable=protected-access
    418         raise errors._make_specific_exception(node_def, op, e.error_message,
--> 419                                               e.code)
    420         # pylint: enable=protected-access
    421       raise e_type, e_value, e_traceback

InvalidArgumentError: Field 1 in record 0 is not a valid int32: 0.766126609

It the has a lot of information following it which I think is irrelevant to the problem. Obviously the problem is that a lot of the data that I am feeding to the program is not of the dtype int32. It's mostly float numbers. I've tried a few things to change the dtype like explicitly setting the dtype=float argument in tf.decode_csv as well as the tf.concat. Neither worked. It's an invalid argument. To top it all off, I don't know if this code is going to actually make a prediction on the data. I want it to predict whether col1 is going to be a 1 or a 0 and I don't see anything in the code that would hint that it's going to actually make that prediction. Maybe I'll save that question for a different thread. Any help is greatly appreciated!

like image 389
Ravaal Avatar asked Nov 19 '15 15:11

Ravaal


People also ask

Which datatype is returned by TensorFlow datasets?

load will return the tuple ( tf. data. Dataset , tfds.

How do I get Dtype of tensor?

We can access the data type of a tensor using the ". dtype" attribute of the tensor. It returns the data type of the tensor.


2 Answers

The interface to tf.decode_csv() is a little tricky. The dtype of each column is determined by the corresponding element of the record_defaults argument. The value for record_defaults in your code is interpreted as each column having tf.int32 as its type, which leads to an error when it encounters floating-point data.

Let's say you have the following CSV data, containing three integer columns, followed by a floating point column:

4, 8, 9, 4.5
2, 5, 1, 3.7
2, 2, 2, 0.1

Assuming all of the columns are required, you would build record_defaults as follows:

value = ...

record_defaults = [tf.constant([], dtype=tf.int32),    # Column 0
                   tf.constant([], dtype=tf.int32),    # Column 1
                   tf.constant([], dtype=tf.int32),    # Column 2
                   tf.constant([], dtype=tf.float32)]  # Column 3

col0, col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defauts)

assert col0.dtype == tf.int32
assert col1.dtype == tf.int32
assert col2.dtype == tf.int32
assert col3.dtype == tf.float32

An empty value in record_defaults signifies that the value is required. Alternatively, if (e.g.) column 2 is allowed to have missing values, you would define record_defaults as follows:

record_defaults = [tf.constant([], dtype=tf.int32),     # Column 0
                   tf.constant([], dtype=tf.int32),     # Column 1
                   tf.constant([0], dtype=tf.int32),    # Column 2
                   tf.constant([], dtype=tf.float32)]   # Column 3

The second part of your question concerns how to build and train a model that predicts the value of one of the columns from the input data. Currently, the program doesn't: it simply concatenates the columns into a single tensor, called features. You will need to define and train a model, that interprets that data. One of the simplest such approaches is linear regression, and you might find this tutorial on linear regression in TensorFlow adaptable to your problem.

like image 92
mrry Avatar answered Oct 21 '22 22:10

mrry


The answer to changing the dtype is to just change the defaults like so-

record_defaults = [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]]

After you do that, if you print out col1, you'll receive this message.

Tensor("DecodeCSV_43:0", shape=TensorShape([]), dtype=float32)

But there is another error that you will run into, which has been answered here. To recap the answer, the workaround is to change tf.concat to tf.pack like so.

features = tf.pack([col2, col3, col4, col5, col6, col7, col8, col9, col10, col11])
like image 21
Ravaal Avatar answered Oct 22 '22 00:10

Ravaal