How to load batches of CSV files using tf.data and map

Question

I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.

I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.

I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults parameter as the CSVs are 500x200.

Here is what I was thinking:

import tensorflow as tf
import pandas as pd

def load_data(path, label):
   # This obviously doesn't work because path and label
   # are Tensors, but this is what I had in mind...
   data = pd.read_csv(path, index_col=0).values()
   return data, label

X_train = tf.constant(training_files)  # training_files is a list of the file names
Y_train = tf.constant(training_labels  # training_labels is a list of labels for each file

train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))

# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)

iterator = tf.data.Iterator.from_structure(train_data.output_types, \
                                           train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)

I have only used Tensorflows feed_dict in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.

Any thoughts? Thanks.

yasin_alm · Accepted Answer

I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.

import tensorflow as tf

def tf_parse_filename(filename):

    def parse_filename(filename_batch):
        data = []
        labels = []
        for filename in filename_batch:
            # Read data
            filename_str = filename.numpy().decode()
            # Read .csv file 
            data_point= np.loadtxt(filename_str, delimiter=',')

            # Create label
            current_label = get_label(filename)
            label = np.zeros(n_classes, dtype=np.float32)
            label[current_label] = 1.0

            data.append(data_point)
            labels.append(label)

        return np.stack(data), np.stack(labels)


    x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32])
    return x, y

train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES)
train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)

#Train on epochs
for i in range(num_epochs):
    # Train on batches
    for x_train, y_train in train_ds:
        train_step(x_train, y_train)

print('Training done!')

"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.

How to load batches of CSV files using tf.data and map

Tags:

python

tensorflow

markdjthomas

1 Answers

yasin_alm

Recent Activity

Donate For Us

How to load batches of CSV files using tf.data and map

Tags:

python

tensorflow

markdjthomas

1 Answers

yasin_alm

Related questions

Recent Activity

Donate For Us