Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git pull multiple remotes in parallel

I have a repo with thousands of remotes, and I'd like to pull from thousands of remotes at the same time, ideally I can specify a maximum number to do at the same time.

I wasn't able to find anything related to this in the manpages, google, or git-scm online.

To be perfectly clear: I do not want to run one command over multiple repos, I have one repo with thousands of remotes.

This has nothing to do with submodules, don't talk about submodules. Submodules are unrelated to git remotes.

like image 565
Incognito Avatar asked Mar 25 '17 18:03

Incognito


2 Answers

Starting from Git 2.24 it it now possible with [--jobs] option.

Some examples:

Fetching 3 remotes, 2 remotes will be fetched in parallel:

git fetch -j2 --multiple remote1 remote2 remote3

Fetching all remotes, 5 remotes will be fetched in parallel:

git fetch -jobs=5 --all

If you have thousands of remotes and you don't want to download all of them and they form some logical groups. Instead of specifying them in command line (with --multiple) options You can also define remote groups like this in .git/config

[remotes]
    group1 = remote1 remote2 origin
    group2 = remote55 remote66

And then use this group in fetch command.

This command: git fetch --multiple -j4 group1 group2 remote10 fetches remote1 remote2 origin remote55 remote66 remote10 remotes and 4 fetches are done in parallel.

like image 114
Mariusz Pawelski Avatar answered Oct 11 '22 00:10

Mariusz Pawelski


I'm pretty sure you have to write your own code to do this.

As CodeWizard says in a comment, Git needs to lock parts of the repository. Some of these locks are bound to collide at times, if you simply run multiple git fetch processes in parallel within a single repository.

You might also want some kind of remote-ordering strategy since, e.g., collecting from remoteA, remoteB, and remoteC in parallel may discover 10000 common objects on remoteB as compared to the other two if remoteB is generally (but not always) a superset of remoteA and remoteC.1 While this also applies to sequential git fetch operations, it becomes considerably less important. Suppose, for example, that there are 5000 objects—some commits, some trees, and some blobs—on A that you do not yet have, 5000 others on C, and all 10000 on B. If you fetch sequentially, in any order, you pick up either 5k, then 5k, then 0; or 10k, then 0, then 0; because by the time you move to the next remote, you have collected and stored the 5k or 10k incoming objects. But if you do all three in parallel, you will bring 5k, 5k, and 10k objects in, and only then discover that you have doubled your workload.


1If B is always a superset, simply go to B first (sequentially), then go to A and C in parallel solely for their references, which will point to objects you now have.

like image 22
torek Avatar answered Oct 10 '22 23:10

torek