From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?
In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?
Thanks
Using the TFRecord format has many advantages: Efficiency: Data in the TFRecord format can take up less space than the original data. Fast I/O: TensorFlow can read data in the TFRecord format with parallel I/O operations. This is very useful when you are working with GPU or TPU devices.
TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.
Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.
Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)
Reading large numbers of small files significantly impacts I/O performance. One approach to get maximum I/O throughput is to preprocess input data into larger (~100MB) TFRecord files. For smaller data sets (200MB-1GB), the best approach is often to load the entire data set into memory.
Currently (19-09-2020) Google recommends the following rule of thumb:
"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."
Source: https://www.tensorflow.org/tutorials/load_data/tfrecord
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With