Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I add a new feature column to a tf.data.Dataset object?

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:

  1. How can a new feature be appended to a tf.data.Dataset object?
  2. Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?

I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.

I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.

Here is the function I use to load the original dataset:

def load_dataset(self):
    # TODO: Function to get max number of available CPU threads
    dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
                                                    self.batch_size,
                                                    label_name='score',
                                                    shuffle_buffer_size=self.get_dataset_size(),
                                                    shuffle_seed=self.seed,
                                                    num_parallel_reads=1)
    return dataset

Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?

like image 424
deepdreams Avatar asked Aug 07 '19 23:08

deepdreams


People also ask

What is TF feature column?

Think of feature columns as the intermediaries between raw data and Estimators. Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that Estimators can use, allowing easy experimentation. In simple words feature column are bridge between raw data and estimator or model.

What is TF data dataset?

TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf. data. Datasets , enabling easy-to-use and high-performance input pipelines. To get started see the guide and our list of datasets.

What does TF data dataset from_tensor_slices do?

With that knowledge, from_tensors makes a dataset where each input tensor is like a row of your dataset, and from_tensor_slices makes a dataset where each input tensor is column of your data; so in the latter case all tensors must be the same length, and the elements (rows) of the resulting dataset are tuples with one ...

How do I create a dataset in TF?

There are two distinct ways to create a dataset: A data source constructs a Dataset from data stored in memory or in one or more files. A data transformation constructs a dataset from one or more tf.data.Dataset objects. To create an input pipeline, you must start with a data source.

How do I create a dataset in TensorFlow?

There are two distinct ways to create a dataset: A data source constructs a Dataset from data stored in memory or in one or more files. A data transformation constructs a dataset from one or more tf.data.Dataset objects. import tensorflow as tf

Is the TF feature_columns module recommended for new code?

Warning: The tf.feature_columns module described in this tutorial is not recommended for new code. Keras preprocessing layers cover this functionality, for migration instructions see the Migrating feature columns guide. The tf.feature_columns module was designed for use with TF1 Estimators.

How to add a new column in pandas Dataframe?

We can use a Python dictionary to add a new column in pandas DataFrame. Use an existing column as the key values and their respective values will be the values for new column. import pandas as pd data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],


1 Answers

Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.

You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".

Then you create a new Dataset by zipping the original and the one with the new data:

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
like image 98
deepdreams Avatar answered Nov 15 '22 08:11

deepdreams