optimal size of a tfrecord file

Tags:

From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?

In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?

Thanks

409

asked Sep 05 '18 18:09

George

2 Answers

Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)

Reading large numbers of small files significantly impacts I/O performance. One approach to get maximum I/O throughput is to preprocess input data into larger (~100MB) TFRecord files. For smaller data sets (200MB-1GB), the best approach is often to load the entire data set into memory.

answered Nov 10 '22 07:11

krobot

Currently (19-09-2020) Google recommends the following rule of thumb:

"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."

Source: https://www.tensorflow.org/tutorials/load_data/tfrecord

answered Nov 10 '22 09:11

MrMuretto

Related questions
                            
                                Anaconda installs TensorFlow 1.15 instead of 2.0
                            
                                Balanced Accuracy Score in Tensorflow
                            
                                no module named tensorflow.contrib
                            
                                How to use "FLAGS" (command line switches) in TensorFlow?
                            
                                How to download Docker images without a direct internet connection
                            
                                How to select rows from a 3-D Tensor in TensorFlow?
                            
                                Resetting default graph does not remove variables
                            
                                Tensor Flow - LSTM - 'Tensor' object not iterable
                            
                                Tensorflow: Interpretation of Weight in Weighted Cross Entropy
                            
                                Tensorflow: tf.get_collection Not Returning Variables in Scope
                            
                                How to get all collections in Tensorflow?
                            
                                Keras ImportError: cannot import name initializations
                            
                                TypeError: Value passed to parameter 'a' has DataType not in list of allowed values
                            
                                Tensorflow: Merge two 2-D tensors according to even and odd indices
                            
                                Tensorboard histograms to matplotlib
                            
                                Tensorflow new Op CUDA kernel memory management
                            
                                InvalidArgumentError : ConcatOp : Dimensions of inputs should match
                            
                                Tensorflow: how to use pretrained weights in new graph?
                            
                                Automatically save Tensorboard-like plot of loss to image file
                            
                                undestanding feed_dict in sess.run

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

optimal size of a tfrecord file

Tags:

tensorflow

tensorflow-datasets

George

People also ask

2 Answers

krobot

MrMuretto

Recent Activity

Donate For Us