I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread. [settings : python, Ubuntu 18.04]
I can't find any answer of which is the best in term of data accessing and storage between :
Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
The development of HDF5 is motivated by a number of limitations in the older HDF format and library. Some of these limitations are: A single file cannot store more than 20,000 complex objects, and a single file cannot be larger than 2 gigabytes.
Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
What is an H5 file? An H5 is one of the Hierarchical Data Formats (HDF) used to store large amount of data. It is used to store large amount of data in the form of multidimensional arrays. The format is primarily used to store scientific data that is well-organized for quick retrieval and analysis.
I would go for multiple files if I were you (but read till the end).
Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).
You could cache those examples (inside custom torch.utils.data.Dataset
instance) during the first past and retrieve cached examples (say in list
or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset
object and it's cache
method).
On the other hand, this approach is more cumbersome and harder to implement correctly, though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.
Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck
) to pinpoint exact problems and verify solutions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With