How do I add a new feature column to a tf.data.Dataset object?

Tags:

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:

How can a new feature be appended to a tf.data.Dataset object?
Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?

I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.

I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.

Here is the function I use to load the original dataset:

def load_dataset(self):
    # TODO: Function to get max number of available CPU threads
    dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
                                                    self.batch_size,
                                                    label_name='score',
                                                    shuffle_buffer_size=self.get_dataset_size(),
                                                    shuffle_seed=self.seed,
                                                    num_parallel_reads=1)
    return dataset

Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?

424

asked Aug 07 '19 23:08

deepdreams

1 Answers

Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.

You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".

Then you create a new Dataset by zipping the original and the one with the new data:

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

answered Nov 15 '22 08:11

deepdreams

Related questions
                            
                                How to download Docker images without a direct internet connection
                            
                                How to select rows from a 3-D Tensor in TensorFlow?
                            
                                Resetting default graph does not remove variables
                            
                                Tensor Flow - LSTM - 'Tensor' object not iterable
                            
                                Tensorflow: Interpretation of Weight in Weighted Cross Entropy
                            
                                Tensorflow: tf.get_collection Not Returning Variables in Scope
                            
                                How to get all collections in Tensorflow?
                            
                                Keras ImportError: cannot import name initializations
                            
                                TypeError: Value passed to parameter 'a' has DataType not in list of allowed values
                            
                                Tensorflow: Merge two 2-D tensors according to even and odd indices
                            
                                Tensorboard histograms to matplotlib
                            
                                Tensorflow new Op CUDA kernel memory management
                            
                                InvalidArgumentError : ConcatOp : Dimensions of inputs should match
                            
                                Tensorflow: how to use pretrained weights in new graph?
                            
                                Automatically save Tensorboard-like plot of loss to image file
                            
                                undestanding feed_dict in sess.run
                            
                                optimal size of a tfrecord file
                            
                                Why TensorBoard summary is not updating?
                            
                                Disabling `@tf.function` decorators for debugging?
                            
                                Creating a ragged tensor from a list of tensors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I add a new feature column to a tf.data.Dataset object?

Tags:

tensorflow

dataset

tensorflow-datasets

deepdreams

People also ask

1 Answers

deepdreams

Recent Activity

Donate For Us