How to handle large amouts of data in tensorflow?

Tags:

For my project I have large amounts of data, about 60GB spread into npy files, each holding about 1GB, each containing about 750k records and labels.

Each record is a 345 float32 and the labels are 5 float32.

I read the tensorflow dataset documentation and the queues / threads documentation as well but I can't figure out how to best handle the input for training and then how save the model and weights for future predicting.

My model is pretty straight forward, it looks like this:

x = tf.placeholder(tf.float32, [None, 345], name='x')
y = tf.placeholder(tf.float32, [None, 5], name='y')
wi, bi = weight_and_bias(345, 2048)
hidden_fc = tf.nn.sigmoid(tf.matmul(x, wi) + bi)
wo, bo = weight_and_bias(2048, 5)
out_fc = tf.nn.sigmoid(tf.matmul(hidden_fc, wo) + bo)
loss = tf.reduce_mean(tf.squared_difference(y, out_fc))
train_op = tf.train.AdamOptimizer().minimize(loss)

The way I was training my neural net was reading the files one at a time in a random order then using a shuffled numpy array to index each file and manually creating each batch to feed the train_op using feed_dict. From everything I read this is very inefficient and I should somehow replace it with datasets or queue and threads but as I said the documentation was of no help.

So, what is the best way to handle large amounts of data in tensorflow?

Also, for reference, my data was saved to a numpy file in a 2 operation step:

with open('datafile1.npy', 'wb') as fp:
    np.save(data, fp)
    np.save(labels, fp)

863

asked Oct 18 '17 23:10

Joao Paulo Farias

1 Answers

The utilities for npy files indeed allocate the whole array in memory. I'd recommend you to convert all of your numpy arrays to TFRecords format and use these files in training. This is one of the most efficient ways to read large dataset in tensorflow.

Convert to TFRecords

def array_to_tfrecords(X, y, output_file):
  feature = {
    'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
    'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten()))
  }
  example = tf.train.Example(features=tf.train.Features(feature=feature))
  serialized = example.SerializeToString()

  writer = tf.python_io.TFRecordWriter(output_file)
  writer.write(serialized)
  writer.close()

A complete example that deals with images can be found here.

Read TFRecordDataset

def parse_proto(example_proto):
  features = {
    'X': tf.FixedLenFeature((345,), tf.float32),
    'y': tf.FixedLenFeature((5,), tf.float32),
  }
  parsed_features = tf.parse_single_example(example_proto, features)
  return parsed_features['X'], parsed_features['y']

def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"),
                   buffer_size=10000,
                   batch_size=100):
  dataset = tf.contrib.data.TFRecordDataset(file_names)
  dataset = dataset.map(parse_proto)
  dataset = dataset.shuffle(buffer_size)
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)

The data manual can be found here.

116

answered Oct 17 '22 06:10

Maxim

Related questions
                            
                                import m2m relation in django-import-export
                            
                                How do I fix a dimension error in TensorFlow?
                            
                                Idioms in python: closure vs functor vs object
                            
                                What pylint options can be specified in inline comments?
                            
                                How can I create an argparse mutually exclusive group with multiple positional parameters?
                            
                                How do you count cars in OpenCV with Python?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Syntaxnet / Parsey McParseface python API
                            
                                What is the proper way of testing throttling in DRF?
                            
                                Python Profiling: What does "method 'poll' of 'select.poll' objects"?
                            
                                TensorFlow freeze_graph.py: The name 'save/Const:0' refers to a Tensor which does not exist
                            
                                Binning of data along one axis in numpy
                            
                                Selenium chromedriver 2.25 TimeoutException cannot determine loading status
                            
                                How to query an advanced search with google customsearch API?
                            
                                "pip install jq" generates errors on Mac and Windows
                            
                                Python3 does not find modules installed by pip3
                            
                                Python parallel execution with selenium
                            
                                How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
                            
                                Understanding LSTM model using tensorflow for sentiment analysis
                            
                                numpy astype from float32 to float16

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle large amouts of data in tensorflow?

Tags:

python

machine-learning

numpy

tensorflow

bigdata

Joao Paulo Farias

People also ask

1 Answers

Maxim

Recent Activity

Donate For Us