I am training my Tensorflow model on data from a CSV file preprocessed by tf.data.Dataset. However, I want the model to fork into three branches corresponding to a different set of csv columns, and model.fit requires a separate dataset for each output. All columns of the CSV file need to undergo the same preprocessing, so the most efficient way of preparing it would be to load the whole file, process it, and then split the dataset into three parts. However, I am struggling to find a way of doing so. I hoped that dataset.map would allow me to select some columns using the following operation: <pre class="prettyprint"><code>dset = dset.map(lambda x: x[[1, 2, 3, 7]]) </code></pre> but it seems that tensorflow interprets it as <code>x[1][2][3][7]</code> instead. The only working way of creating separate datasets that I've found was to do it from the beginning: <pre class="prettyprint"><code>y = [] for cls, keys in output_classes.items(): tmp = tf.data.experimental.CsvDataset(data_path, [tf.int32 for i in keys], select_cols=keys) [...] y.append(tmp) y = tf.data.Dataset.zip(tuple(y)) </code></pre> Unfortunately, it producess a lot of unnecessary overhead and immensely slows down the training. Is there a way of splitting tf.data.Dataset object by a subset of features?

Try <code>tf.gather</code>: <pre class="prettyprint"><code>tf.gather(tf.constant([1,2,3,4]), [1,2,3]) # ouputs : array([2, 3, 4]) </code></pre> If you have high dimensional data, use <code>tf.gather_nd</code>.

How to select specific columns from tensorflow dataset?

Tags:

python

tensorflow

tensorflow-datasets

I am training my Tensorflow model on data from a CSV file preprocessed by tf.data.Dataset. However, I want the model to fork into three branches corresponding to a different set of csv columns, and model.fit requires a separate dataset for each output. All columns of the CSV file need to undergo the same preprocessing, so the most efficient way of preparing it would be to load the whole file, process it, and then split the dataset into three parts. However, I am struggling to find a way of doing so.

I hoped that dataset.map would allow me to select some columns using the following operation:

dset = dset.map(lambda x: x[[1, 2, 3, 7]])

but it seems that tensorflow interprets it as x[1][2][3][7] instead.

The only working way of creating separate datasets that I've found was to do it from the beginning:

y = []
for cls, keys in output_classes.items():
    tmp = tf.data.experimental.CsvDataset(data_path, [tf.int32 for i in keys], select_cols=keys)
    [...]
    y.append(tmp)
y = tf.data.Dataset.zip(tuple(y))

Unfortunately, it producess a lot of unnecessary overhead and immensely slows down the training.

Is there a way of splitting tf.data.Dataset object by a subset of features?

343

asked Nov 25 '19 14:11

Ginterhauser

Video Answer

2 Answers

Try tf.gather:

tf.gather(tf.constant([1,2,3,4]), [1,2,3])
# ouputs : array([2, 3, 4])

If you have high dimensional data, use tf.gather_nd.

answered Oct 20 '22 13:10

tornikeo

This solution has worked for me by modifying tornikeo's answer with a .map().

dataset = tf.data.Dataset.from_tensor_slices([[1,2,3,4], 
                                              [5,6,7,8]])
dataset_filter = dataset.map(lambda x: tf.gather(x, [0, 2], axis=0))
result = list(dataset_filter.as_numpy_iterator())
print(result)

# Outputs array([1, 3], dtype=int32), array([5, 7])

answered Oct 20 '22 13:10

theudbald

Related questions
                            
                                IPython autoreload changes in subdirectory
                            
                                Tkinter's overrideredirect prevents certain events in Mac and Linux
                            
                                How to embed Python3 with the standard library
                            
                                How can I filter a Pandas GroupBy object and obtain a GroupBy object back?
                            
                                "OverflowError: Allocated too many blocks":
                            
                                Authenticate in Django without a database
                            
                                Comparing logical values to NaN in pandas/numpy
                            
                                How to nest LabelKFold?
                            
                                Performance issues with pandas and filtering on datetime column
                            
                                Tensorflow: How to pass output from previous time-step as input to next timestep
                            
                                pyLDAvis visualization of pyspark generated LDA model
                            
                                OpenALPR not work with PyQt
                            
                                Python: docstrings and type annotations
                            
                                QM coding implementation in Python - is 16 bit word obligatory?
                            
                                pandas rolling apply doesn't do anything
                            
                                Include output from %matplotlib notebook backend as SVG in ipynb
                            
                                How to rotate the 3D scatter plots in google colaboratory?
                            
                                python how to use tika with existing jar file without downloading again
                            
                                How can I get Chrome Browser Version running now with Python? [closed]
                            
                                Weird file seeking behaviour

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With