What is the best beetween multiple small h5 files or one huge?

Tags:

I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread. [settings : python, Ubuntu 18.04]

I can't find any answer of which is the best in term of data accessing and storage between :

registering all the data in one huge HDF5 file (over 20Go)
splitting it into multiple (over 16 000) small HDF5 files (approx 1.4Mo).

Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?

906

asked Jul 04 '19 08:07

NanBlanc

1 Answers

I would go for multiple files if I were you (but read till the end).

Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).

You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).

On the other hand, this approach is more cumbersome and harder to implement correctly, though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.

Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.

answered Sep 19 '22 16:09

Szymon Maszke

Related questions
                            
                                HashSet and multithreading
                            
                                How do I populate a QTableView with asynchronous fetched data?
                            
                                Thread Safety in Concurrent Queue C#
                            
                                What is the differences between multithreading vs concurrent vs parallel vs asynchronous programming?
                            
                                Safe publication example in Java Concurrency in Practice
                            
                                Run threads in gtest
                            
                                Free TLS pointers for every thread
                            
                                How to read std::queue shared with another thread?
                            
                                Sandbox Swift code in a controlled environement, error safe
                            
                                Why Reactive programming application( vert.x) is faster than single thread lock free , no-blocking java application?
                            
                                Is python asyncio call_soon_threadsafe really thread-safe?
                            
                                Can I safelly share a variable across threads in C++ using only std::atomic without std::mutex?
                            
                                C++ 11 can you safely pass and access std::atomics by reference in different threads
                            
                                How to create threads in dlls (c++)?
                            
                                Java multi-threaded program using join() gives wrong results in calculating sum of adjacent numbers
                            
                                JVMTI_ERROR_THREAD_NOT_ALIVE error using multiple activites and OpenWeatherMap API
                            
                                Print integers from 1 to 10 with only 5 Threads in a specific order [duplicate]
                            
                                Search in call stack
                            
                                Does modification order contribute to happens-before relationship?
                            
                                Does C++ guarantees it is safe to access adjacent elements of an array from two threads

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best beetween multiple small h5 files or one huge?

Tags:

multithreading

h5py

bigdata

NanBlanc

People also ask

1 Answers

Szymon Maszke

Recent Activity

Donate For Us