How to speed up / parallelize downloads of git submodules using git clone --recursive?

Tags:

Cloning git repositories that have a lot submodules takes a really long time. In the following example are ~100 submodules

git clone --recursive https://github.com/Whonix/Whonix

Git clones them one by one. Takes much longer than required. Let's make the (probable) assumption that both the client and the server has sufficient resources to answer multiple (parallel) requests at the same time.

How to speed up / parallelize downloads of git submodules using git clone --recursive?

499

asked Sep 24 '14 17:09

adrelanos

2 Answers

With git 2.8 (Q12016), you will be able to initiate the fetch of submodules... in parallel!

See commit fbf7164 (16 Dec 2015) by Jonathan Nieder (artagnon).
See commit 62104ba, commit fe85ee6, commit c553c72, commit bfb6b53, commit b4e04fb, commit 1079c4b (16 Dec 2015) by Stefan Beller (stefanbeller).
^{(Merged by Junio C Hamano -- gitster -- in commit 187c0d3, 12 Jan 2016)}

Add a framework to spawn a group of processes in parallel, and use it to run "git fetch --recurse-submodules" in parallel.

For that, git fetch has the new option:

-j, --jobs=<n>

Number of parallel children to be used for fetching submodules.
Each will fetch from different submodules, such that fetching many submodules will be faster.
By default submodules will be fetched one at a time.

Example:

git fetch --recurse-submodules -j2

The bulk of this new feature is in commit c553c72 (16 Dec 2015) by Stefan Beller (stefanbeller).

run-command: add an asynchronous parallel child processor

This allows to run external commands in parallel with ordered output on stderr.

If we run external commands in parallel we cannot pipe the output directly to the our stdout/err as it would mix up. So each process's output will flow through a pipe, which we buffer. One subprocess can be directly piped to out stdout/err for a low latency feedback to the user.

Note that, before Git 2.24 ( Q4 2019), "git fetch --jobs=<n>" allowed <n> parallel jobs when fetching submodules, but this did not apply to "git fetch --multiple" that fetches from multiple remote repositories.
It now does.

See commit d54dea7 (05 Oct 2019) by Johannes Schindelin (dscho).
^{(Merged by Junio C Hamano -- gitster -- in commit d96e31e, 15 Oct 2019)}

fetch: let --jobs=<n> parallelize --multiple, too

^{Signed-off-by: Johannes Schindelin}

So far, --jobs=<n> only parallelizes submodule fetches/clones, not --multiple fetches, which is unintuitive, given that the option's name does not say anything about submodules in particular.

Let's change that.
With this patch, also fetches from multiple remotes are parallelized.

For backwards-compatibility (and to prepare for a use case where submodule and multiple-remote fetches may need different parallelization limits):

the config setting submodule.fetchJobs still only controls the submodule part of git fetch,

while the newly-introduced setting fetch.parallel controls both (but can be overridden for submodules with submodule.fetchJobs).

112

answered Sep 22 '22 02:09

VonC

When I run your command it takes 338 seconds wall-time for downloading the 68 Mb.

With the following Python program that relies on GNU parallel to be installed,

#! /usr/bin/env python # coding: utf-8  from __future__ import print_function  import os import subprocess  jobs=16  modules_file = '.gitmodules'  packages = []  if not os.path.exists('Whonix/' + modules_file):     subprocess.call(['git', 'clone', 'https://github.com/Whonix/Whonix'])  os.chdir('Whonix')  # get list of packages from .gitmodules file with open(modules_file) as ifp:     for line in ifp:         if not line.startswith('[submodule '):             continue         package = line.split(' "', 1)[1].split('"', 1)[0]         #print(package)         packages.append(package)  def doit():     p = subprocess.Popen(['parallel', '-N1', '-j{0}'.format(jobs),                           'git', 'submodule', 'update', '--init',                           ':::'],                          stdin=subprocess.PIPE, stdout=subprocess.PIPE)     res = p.communicate('\n'.join(packages))     print(res[0])     if res[1]:         print("error", res[1])     print('git exit value', p.returncode)     return p.returncode  # sometimes one of the updates interferes with the others and generate lock # errors, so we retry for x in range(10):     if doit() == 0:         print('zero exit from git after {0} times'.format(x+1))         break else:     print('could not get a non-zero exit from git after {0} times'.format(           x+1))

that time is reduced to 45 seconds (on the same system, I did not do multiple runs to average out fluctuations).

To check if things were OK, I "compared" the checked out files with:

find Whonix -name ".git" -prune -o -type f -print0 | xargs -0 md5sum > /tmp/md5.sum

in the one directory and

md5sum -c /tmp/md5sum

in the other directory and vice versa.

answered Sep 24 '22 02:09

Anthon

Related questions
                            
                                Typescript: constants in an interface
                            
                                Is packages.config required in a deployed asp.net mvc solution
                            
                                Underlying mechanism of String pooling in Java?
                            
                                Cannot find cache named '' for CacheableOperation[] caches
                            
                                How to obtain Google service account access token javascript
                            
                                How to check if a Bitmap is empty (blank) on Android
                            
                                'Already Connected' exception trying to do POST request using Jersey Client API
                            
                                Set default value for select html element in Jinja template?
                            
                                Why Int Does Not Implement 'Monoid'?
                            
                                Json.NET require all properties on deserialization
                            
                                Swift compiler error: “Cannot invoke 'map' with an argument list of type '((_) -> _)'”
                            
                                How should I persist timestamps in SQL DB if app uses NodaTime?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to speed up / parallelize downloads of git submodules using git clone --recursive?

Tags:

adrelanos

People also ask

2 Answers

`run-command`: add an asynchronous parallel child processor

`fetch`: let `--jobs=<n>` parallelize `--multiple`, too

VonC

Anthon

Recent Activity

Donate For Us

How to speed up / parallelize downloads of git submodules using git clone --recursive?

Tags:

adrelanos

People also ask

2 Answers

run-command: add an asynchronous parallel child processor

fetch: let --jobs=<n> parallelize --multiple, too

VonC

Anthon

Related questions

Recent Activity

Donate For Us

`run-command`: add an asynchronous parallel child processor

`fetch`: let `--jobs=<n>` parallelize `--multiple`, too