Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimal size of a tfrecord file

From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?

In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?

Thanks

like image 409
George Avatar asked Sep 05 '18 18:09

George


People also ask

Should I use TFRecord?

Using the TFRecord format has many advantages: Efficiency: Data in the TFRecord format can take up less space than the original data. Fast I/O: TensorFlow can read data in the TFRecord format with parallel I/O operations. This is very useful when you are working with GPU or TPU devices.

What is a TFRecord file?

TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.

How do you make TFRecord?

Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.


2 Answers

Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)

Reading large numbers of small files significantly impacts I/O performance. One approach to get maximum I/O throughput is to preprocess input data into larger (~100MB) TFRecord files. For smaller data sets (200MB-1GB), the best approach is often to load the entire data set into memory.

like image 52
krobot Avatar answered Nov 10 '22 07:11

krobot


Currently (19-09-2020) Google recommends the following rule of thumb:

"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."

Source: https://www.tensorflow.org/tutorials/load_data/tfrecord

like image 42
MrMuretto Avatar answered Nov 10 '22 09:11

MrMuretto