Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GNU parallel with rsync

I'm trying to run some instances of rsync in parallel using ssh with GNU parallel. The command I'm running is like this:

find /tmp/tempfolder -type f -name 'chunck.*' | sort | parallel --gnu -j 4 -v ssh -i access.pem user@server echo {}\; rsync -Havessh -auz -0 --files-from={} ./ user@server:/destination/path

/tmp/tempfolder contains files with the prefix chunck and they contain the actual file lists.

With this command, I got the 4 calls for rsync alright, but they take a while to start running and don't start all together and don't run in parallel.

What am I doing wrong?

like image 544
Daivid Avatar asked Mar 26 '14 21:03

Daivid


People also ask

Can you run rsync in parallel?

Parallel rsync can be set up using a wrapper like this one: "[Multi-Stream-rsync] will split the transfer in multiple buckets while the source is scanned… The main limitation is it does not handle remote source or target directory, they must be locally accessible (local disk, nfs/cifs/other mountpoint)."

Is parallel rsync faster?

rsync is a great tool for quickly copying files between two locations. However, it can be slow when copying large numbers of files. GNU Parallel is a great tool for speeding up file transfers by using multiple processors. When used together, these two tools can significantly speed up the process of copying files.

Why is rsync so slow?

Cause: rsync is a serial operation, so it is slow when copying a large file system, especially if snapshots are included in the process.

How do I rsync multiple files at once?

If you want to copy multiple files at once from one location to another within your system, you can do so by typing rsync followed by source files name and the destination directory.


1 Answers

Are you sure the rsyncs are really not running in parallel ?
Checking with ps | grep rsync while the command is running will show which and how many rsyncs are actually running simultaneously.

By default, parallel holds printing output from each job until it's finished so that the different commands' output don't get all mixed up together:

--group  Group output. Output from each jobs is grouped together and is only printed when the command
         is finished. stderr (standard error) first followed by stdout (standard output). This takes
         some CPU time. In rare situations GNU parallel takes up lots of CPU time and if it is
         acceptable that the outputs from different commands are mixed together, then disabling
         grouping with -u can speedup GNU parallel by a factor of 10.

         --group is the default. Can be reversed with -u.

My guess is the rsyncs are actually running in parallel, but from the output it feels like they're running serial. -u option changes that.

--

For example with this cmd:

$ for i in 1 2 3 ; do echo a$i ; sleep 1 ; done
a1
a2
a3

By default in parallel we get no feedback until it's all done:

$ (echo a ; echo b ; echo c ) | parallel 'for i in 1 2 3 ; do echo {}$i ; sleep 1 ; done  ' 
a1
a2
a3
b1
b2
b3
c1
c2
c3

Whereas with -u stuff get printed right away:

$ (echo a ; echo b ; echo c ) | parallel -u 'for i in 1 2 3 ; do echo {}$i ; sleep 1 ; done  ' 
a1
b1
c1
a2
b2
c2
a3
b3
c3

In both cases it took 3s to run though so it's really running simultaneously...

like image 111
lemonsqueeze Avatar answered Oct 05 '22 05:10

lemonsqueeze