Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel file writing is it efficient?

I would like to know if parallel file writing is efficient. Indeed, a hard disk has one usable read head at a time. Thus the HDD can to do one task at a time. But below test (in python) contradict what I expecting:

The file to copy is around 1 Gb

Script 1 ( // task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
from multiprocessing import Pool
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

script 2 ( task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

script 3 ( // task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
from multiprocessing import Pool
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

script 4 ( task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

Results:

$ # // task to read and write line by line 10 times a same file
$ time python read_write_1.py

real    1m46.484s
user    3m40.865s
sys 0m29.455s

$ rm test_jm*
$ # task to read and write line by line 10 times a same file
$ time python read_write_2.py

real    4m16.530s
user    3m41.303s
sys 0m24.032s

$ rm test_jm*
$ # // task to copy 10 times a same file
$ time python read_write_3.py

real    1m35.890s
user    0m10.615s
sys 0m36.361s


$ rm test_jm*
$ # task to copy 10 times a same file
$ time python read_write_4.py

real    1m40.660s
user    0m7.322s
sys 0m25.020s
$ rm test_jm*

These basics results seems to show that // io read and write is more efficient.

Thanks for you light

like image 559
bioinfornatics Avatar asked Jan 08 '16 00:01

bioinfornatics


People also ask

Is Python good for parallel?

Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. This parallelization leads to significant speedup in tasks that involve a lot of computation.

How do I read parallel files?

When reading in parallel, each partition of the graph will process part of the file. A file split (or more commonly split) is a contiguous segment of a data file, spanning a range of bytes. DataFlow performs parallel reads by breaking files into a number of splits and assigning them to different partitions.


1 Answers

I would like to know if parallel file writing is efficient.

Short answer: physically writing to the same disk from multiple threads at the same time, will never be faster than writing to that disk from one thread (talking about normal hard disks here). In some cases it can even be a lot slower.

But, as always, it depends on a lot of factors:

  • OS disk caching: writes are usually kept in cache by the OS, and then written to the disk in chunks. So multiple threads can write to that cache simultaneously without a problem, and have a speed advantage doing so. Especially if the processing / preparing of the data takes longer than the writing speed of the disk.

  • In some cases, even when writing directly to the physical disk from multiple threads, the OS will optimize this and only write large blocks to each file.

  • In the worst case scenario however, smaller blocks could be written to disk each time, resulting in the need for a hard disk seek (± 10ms on a normal hdd!) on every file-switch (doing the same on a SSD wouldn't be so bad because there is more direct access and no seeks are needed).

So, in general, when writing to disk from multiple threads simultaneously, it might be a good idea to prepare (some) data in memory, and write the final data to disk in larger blocks using some kind of lock, or perhaps from one dedicated writer-thread. If the files are growing while being written to (i.e. no file size is set up front), writing the data in larger blocks could also prevent disk fragmentation (at least as much as possible).

On some systems there might be no difference at all, but on others it can make a big difference, and become a lot slower (or even on the same system with different hard disks).

To have a good test of the differences in writing speeds using a single thread vs multiple threads, total file sizes would have to be bigger than the available memory - or at least all buffers should be flushed to disk before measuring the end time. Measuring only the time it takes to write the data to the OS disk cache wouldn't make much sense here.

Ideally, the total time measured to write all data to disk should equal the physical hard disk writing speed. If writing to disk using one thread is slower than the disk write speed (which means processing of the data takes longer than writing it), obviously using more threads will speed things up. If writing from multiple threads becomes slower than the disk write speed, time will be lost in disk seeks caused by switching between the different files (or different blocks inside the same big file).

To get an idea of the loss in time when performing lots of disk seeks, let's look at some numbers:

Say, we have a hdd with a write speed of 50MB/s:

  • Writing one contiguous block of 50MB would take 1 second (in ideal circumstances).

  • Doing the same in blocks of 1MB, with a file-switch and resulting disk seek in between would give: 20ms to write 1MB + 10ms seek time. Writing 50MB would take 1.5 seconds. that's a 50% increase in time, only to do a quick seek in between (the same holds for reading from disk as well - the difference will even be bigger, considering the faster reading speed).

In reality it will be somewhere in between, depending on the system.

While we could hope the OS takes good care of all that (or by using IOCP for example), this isn't always the case.

like image 166
Danny_ds Avatar answered Oct 04 '22 03:10

Danny_ds