Logo Questions Linux Laravel Mysql Ubuntu Git Menu

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.

In my situation I have a file data.csv with my features, and would like to do the following two tasks:

  1. Compute targets - the target at time t is the percent change of some column at some horizon, i.e.,

    labels[i] = features[i + h, -1] / features[i, -1] - 1

    I would like h to be a parameter here, so I can experiment with different horizons.

  2. Get rolling windows - for training purposes, I need to roll my features into windows of length window:

    train_features[i] = features[i: i + window]

I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.

Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?

like image 963
ira Avatar asked Jan 14 '18 03:01


People also ask

Is TensorFlow better than NumPy?

TensorFlow integrates a lot more functionality that is not strictly array manipulation into the library itself, like image manipulation and common neural network utilities. NumPy tends to defer that kind of things to additional libraries like SciPy, making it more of an ecosystem and less monolithic.

When should I use NumPy instead of pandas?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

What is TensorFlow data API?

The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.

Does TensorFlow use NumPy?

TensorFlow implements a subset of the NumPy API, available as tf. experimental. numpy . This allows running NumPy code, accelerated by TensorFlow, while also allowing access to all of TensorFlow's APIs.

1 Answers

First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices()

A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":

While feeding data using a feed_dict offers a high level of flexibility, in most instances using feed_dict does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:

# feed_dict often results in suboptimal performance when using large inputs  
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.

The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

like image 87
Maxim Avatar answered Oct 12 '22 07:10
