Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load batches of CSV files using tf.data and map

I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.

I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.

I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults parameter as the CSVs are 500x200.

Here is what I was thinking:

import tensorflow as tf
import pandas as pd

def load_data(path, label):
   # This obviously doesn't work because path and label
   # are Tensors, but this is what I had in mind...
   data = pd.read_csv(path, index_col=0).values()
   return data, label

X_train = tf.constant(training_files)  # training_files is a list of the file names
Y_train = tf.constant(training_labels  # training_labels is a list of labels for each file

train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))

# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)

iterator = tf.data.Iterator.from_structure(train_data.output_types, \
                                           train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)

I have only used Tensorflows feed_dict in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.

Any thoughts? Thanks.

like image 311
markdjthomas Avatar asked Sep 18 '25 04:09

markdjthomas


1 Answers

I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.

import tensorflow as tf

def tf_parse_filename(filename):

    def parse_filename(filename_batch):
        data = []
        labels = []
        for filename in filename_batch:
            # Read data
            filename_str = filename.numpy().decode()
            # Read .csv file 
            data_point= np.loadtxt(filename_str, delimiter=',')

            # Create label
            current_label = get_label(filename)
            label = np.zeros(n_classes, dtype=np.float32)
            label[current_label] = 1.0

            data.append(data_point)
            labels.append(label)

        return np.stack(data), np.stack(labels)


    x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32])
    return x, y

train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES)
train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)

#Train on epochs
for i in range(num_epochs):
    # Train on batches
    for x_train, y_train in train_ds:
        train_step(x_train, y_train)

print('Training done!')

"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.

like image 74
yasin_alm Avatar answered Sep 19 '25 16:09

yasin_alm