My application uses text file to store data to file. I was testing for the fastest way of reading it by multi threading the operation. I used the following 2 techniques:
Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.
Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.
The test was run on various file sizes 100kB - 800MB. Data in file:
100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...
The data is stored in 2D array of double
.
Result: Both methods take approximately the same time for parsing the file.
Question: Which method should I choose?
Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.
Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.
Multiple threads can also read data from the same FITS file simultaneously, as long as the file was opened independently by each thread. This relies on the operating system to correctly deal with reading the same file by multiple processes.
What is Multi-thread Processing. Multi-thread processing is a mechanism to accomplish high performance by partitioning and processing the data in multiple threads parallelly. The number of threads will be determined automatically under multiple conditions such as the size of the read data or the number of CPU cores.
Solution 2. You have to use some kind of locking mechanism for the file (see File locking in Linux[^]), or - when the threads belong to the same process and you have a shared file descriptor / handle - the code blocks that are writing to the file.
Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.
Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.
If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.
To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)
I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With