Parallel file writing is it efficient?

Tags:

I would like to know if parallel file writing is efficient. Indeed, a hard disk has one usable read head at a time. Thus the HDD can to do one task at a time. But below test (in python) contradict what I expecting:

The file to copy is around 1 Gb

Script 1 ( // task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
from multiprocessing import Pool
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

script 2 ( task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

script 3 ( // task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
from multiprocessing import Pool
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

script 4 ( task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

Results:

$ # // task to read and write line by line 10 times a same file
$ time python read_write_1.py

real    1m46.484s
user    3m40.865s
sys 0m29.455s

$ rm test_jm*
$ # task to read and write line by line 10 times a same file
$ time python read_write_2.py

real    4m16.530s
user    3m41.303s
sys 0m24.032s

$ rm test_jm*
$ # // task to copy 10 times a same file
$ time python read_write_3.py

real    1m35.890s
user    0m10.615s
sys 0m36.361s


$ rm test_jm*
$ # task to copy 10 times a same file
$ time python read_write_4.py

real    1m40.660s
user    0m7.322s
sys 0m25.020s
$ rm test_jm*

These basics results seems to show that // io read and write is more efficient.

Thanks for you light

559

asked Jan 08 '16 00:01

bioinfornatics

1 Answers

I would like to know if parallel file writing is efficient.

Short answer: physically writing to the same disk from multiple threads at the same time, will never be faster than writing to that disk from one thread (talking about normal hard disks here). In some cases it can even be a lot slower.

But, as always, it depends on a lot of factors:

OS disk caching: writes are usually kept in cache by the OS, and then written to the disk in chunks. So multiple threads can write to that cache simultaneously without a problem, and have a speed advantage doing so. Especially if the processing / preparing of the data takes longer than the writing speed of the disk.
In some cases, even when writing directly to the physical disk from multiple threads, the OS will optimize this and only write large blocks to each file.
In the worst case scenario however, smaller blocks could be written to disk each time, resulting in the need for a hard disk seek (± 10ms on a normal hdd!) on every file-switch (doing the same on a SSD wouldn't be so bad because there is more direct access and no seeks are needed).

So, in general, when writing to disk from multiple threads simultaneously, it might be a good idea to prepare (some) data in memory, and write the final data to disk in larger blocks using some kind of lock, or perhaps from one dedicated writer-thread. If the files are growing while being written to (i.e. no file size is set up front), writing the data in larger blocks could also prevent disk fragmentation (at least as much as possible).

On some systems there might be no difference at all, but on others it can make a big difference, and become a lot slower (or even on the same system with different hard disks).

To have a good test of the differences in writing speeds using a single thread vs multiple threads, total file sizes would have to be bigger than the available memory - or at least all buffers should be flushed to disk before measuring the end time. Measuring only the time it takes to write the data to the OS disk cache wouldn't make much sense here.

Ideally, the total time measured to write all data to disk should equal the physical hard disk writing speed. If writing to disk using one thread is slower than the disk write speed (which means processing of the data takes longer than writing it), obviously using more threads will speed things up. If writing from multiple threads becomes slower than the disk write speed, time will be lost in disk seeks caused by switching between the different files (or different blocks inside the same big file).

To get an idea of the loss in time when performing lots of disk seeks, let's look at some numbers:

Say, we have a hdd with a write speed of 50MB/s:

Writing one contiguous block of 50MB would take 1 second (in ideal circumstances).
Doing the same in blocks of 1MB, with a file-switch and resulting disk seek in between would give: 20ms to write 1MB + 10ms seek time. Writing 50MB would take 1.5 seconds. that's a 50% increase in time, only to do a quick seek in between (the same holds for reading from disk as well - the difference will even be bigger, considering the faster reading speed).

In reality it will be somewhere in between, depending on the system.

While we could hope the OS takes good care of all that (or by using IOCP for example), this isn't always the case.

166

answered Oct 04 '22 03:10

Danny_ds

Related questions
                            
                                using google app engine SDK in pycharm
                            
                                python easy_install fails with SSL certificate error for all packages
                            
                                How to plot a pandas multiindex dataFrame with all xticks
                            
                                Command python setup.py egg_info failed with error code 1
                            
                                How to read .npy files in Matlab
                            
                                Tkinter set focus on Entry widget
                            
                                How does open() work with and without `with`?
                            
                                How to comment each condition in a multi-line if statement?
                            
                                Antialiasing shapes in Pygame
                            
                                Get rid of NaT values from pandas dataframe
                            
                                Sphinx documentation: how to reference a Python property?
                            
                                Vectorizing a function in pandas
                            
                                Sklearn TFIDF vectorizer to run as parallel jobs
                            
                                Unable to start App Engine application after updating it via Google Cloud SDK
                            
                                Geany - How to execute code in terminal pane instead of external terminal
                            
                                SettingWithCopyWarning even when using .loc[row_indexer,col_indexer] = value
                            
                                How can i count occurrence of each word in document using Dictionary comprehension
                            
                                Confusing about a Python min quiz
                            
                                Getting attribute error: 'map' object has no attribute 'sort'
                            
                                How to count number of space in given string in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel file writing is it efficient?

Tags:

python

io

parallel-processing

bioinfornatics

People also ask

1 Answers

Danny_ds

Recent Activity

Donate For Us