Is there a way I can only pull the latest commit in a git submodule? I was trying to put boost as a git submodule in some projects but since the boost repo with everything included is really heavyweight I wanted to only update the submodules to the latest commit and not pull all commits. Is this possible?
For example, when I do
git submodule update --init --recursive
All the boost submodules get pulled with all their commits. Can I only ask the submodules to mirror the latest commit instead of pulling all changes?
Note Shallow clones with the --depth
flag do not work because that only pulls the latest commit, and the latest commit has only the changes made in that commit, so the repository is not in the right state.
Note git archive
(as suggested in an answer below) does not seem to work when I try the following sequence of commands
mkdir temp-git-test
cd temp-git-test
git init
git submodule add --depth 1 https://github.com/boostorg/boost
cd boost
git archive --format=tar HEAD --output ../boost.tar.gz
cd ..
tar -xzvf boost.tar.gz
The output of the unzipped repo is the same as the submodule. Am I doing something wrong?
Pulling with submodules. Once you have set up the submodules you can update the repository with fetch/pull like you would normally do. To pull everything including the submodules, use the --recurse-submodules and the --remote parameter in the git pull command .
In order to update an existing Git submodule, you need to execute the “git submodule update” with the “–remote” and the “–merge” option. Using the “–remote” command, you will be able to update your existing Git submodules without having to run “git pull” commands in each submodule of your project.
Git submodules allow you to keep a git repository as a subdirectory of another git repository. Git submodules are simply a reference to another repository at a particular snapshot in time. Git submodules enable a Git repository to incorporate and track version history of external code.
The short answer is no. The long answer is maybe, but consider another way.
The long answer, which lets you get partway to what you want, starts with a technical note: you're not pulling, in Git terms. In Git, "pull" means "fetch, then merge-or-rebase" and you are not going to merge-or-rebase here. In fact, when you're "init"-ing you are generally going to make the initial clones.
Each submodule is actually its own repository.1 Git is, sooner or later, going to do a git checkout
within each of those repositories, asking it to check out, not a branch, but rather one specific commit, which is quite often not the latest commit. Given the nature of Git repositories and software development, and the idea that a submodule is, in the first place, a reference to a third-party repository, i.e., one you specifically do not and cannot control, the best you can do is say: "I know that my software works with one specific version of their software, and that version is <fill in the blank>." Thus, your repository lists the specific version you want from their repository.
Now we get to the heart of the problem. When you git clone
a repository, or use git fetch
to update an existing clone, you do so by asking for specific branch and/or tag names, rather than specific commit IDs. There is some (very limited) support for fetching specific IDs, but it must be enabled in that other repository, the one we just said that you do not and cannot control. Enabling fetch-by-ID is computationally expensive for them—whoever "they" are, the ones controlling the other repository—and not something you can do on your side, nor demand, nor is it enabled by default. This means that in general it's just not available.
In any case, git clone
only works with names: you may git clone -b branch url
, for instance, to make your new clone start by checking out that specific branch, or git clone -b tag url
to make your new clone start by checking out (as a detached HEAD) that specific tag. Despite this "check out a specific branch or tag", though, the clone defaults to cloning all the names offered by the remote, and making a full-depth (i.e., non-shallow) clone.
All of this does mean something important. First, shallow clones exist. A shallow clone is one made with a --depth
argument. It can be deepened by a git fetch
with another --depth
. The "depth" is the number of commits fetched "beyond" the commit(s) identified by the name(s) used during the clone or fetch, with some fairly complicated rules. (The details of these rules don't matter much here.)
Second, because shallow clones exist, shallow submodules also exist. A shallow submodule is simply a submodule that is cloned with --depth
. But there is a problem: there is no easy or obvious way to determine what depth is needed. You can pass a --depth
argument to git submodule add
or git submodule update
, but it's not obvious how deep you should go.
Here's the problem: your submodule will be cloned, perhaps by a branch or tag name, but then your submodule will be told to check out one particular commit (by its raw hash ID). Will that commit be in the clone? What depth guarantees that it will? If you are cloning by tag name, and the tag always names the correct commit, you can use --depth 1
(and hence you can use --shallow-submodules
during the initial git clone
as well), but that only works if, well, see above.
1What's special about these sub-repositories is that they are:
.gitmodules
file);The modules file lists the names and URLs for the various submodules. "Initializing" a submodule amounts to copying stuff from .gitmodules
to the configuration file for the containing superproject, and "updating" a submodule usually amounts to cloning or fetching. The commit at which the submodule is to be detached is recorded in the superproject's repository as a "gitlink" entry in a tree object.
Submodule support has grown rather complex in modern versions of Git though, so now there are more things you can do when doing the update step.
There is a much better, more general solution for many cases. Instead of fussing with shallow clones, you can point Git at a reference clone. The reference clone is any clone of the repository you're trying to clone.2 Ideally, it's a recent and reasonably up-to-date clone of the repository you are cloning, but any clone will do.
What Git does with a reference clone is a bit complicated (see the documentation for details), but the short version is that when cloning some repository, instead of getting all the objects over the network from some distant server (which may be slow and/or rate-limited), your Git will ask the distant server what objects and such it needs, then look at your local3 reference clone to see if it already has those objects. If so, it will "borrow" them from the reference clone.
This lets you obtain a full, complete, up-to-date clone while using very little network and storage resources, since you will no longer need to bring (most or all of) the data over, nor (unless --detach
-ing) store it yourself. That in turn means you need not worry about your shallow clone being too shallow: you just get one slow full clone, then reference the heck out of it for all other clones, which go fast. Using reference clones can cut the time to clone a few big GitHub repositories, from an hour-plus, down to tens of seconds, for instance.
2Technically, the reference could be any repository at all. A repository not actually related to the one you are cloning is going to make a lousy reference, though: it will have none of the objects you need, and will provide no speedup at all. (It could even have the wrong data under the object's name, although the chances of this are vanishingly small. This cannot happen if the reference is correct since object names cannot be reused this way.)
3The reference should be "as local as possible" for speed, but does not really have to be on your machine, just accessible. If the reference will not always be present you will probably want to add --dissociate
, so that the objects get copied from the reference clone into the new clone. This uses more disk space, of course.
Note Shallow clones with the --depth flag do not work because that only pulls the latest commit, and the latest commit has only the changes made in that commit, so the repository is not in the right state.
Then combine a git archive
of the boost
repo with a shallow clone setting for your submodule:
git archive
image of the same repo, making the working tree an exact replica of the remote repo SHA1.From there, each refresh (shallow) will complement a content which was complete, and will remain up-to-date.
git archive
is done in a local clone of the repo:
git archive --format=tar HEAD
If you don't have a local clone, but the boost repo is on GitHub (like, for instance, boostorg/boost
), then you can get a compressed image of the current HEAD with a simple curl (no need for git archive
then).
As seen in the comment, adding the content of an archive is of no use, as it represents the same content of the commit.
However, this seems incomplete:
git submodule add --depth 1 https://github.com/boostorg/boost
For a submodule update --remote to work (ie to fetch the last commit, instead of keeping the initial SHA1 checkout), you would need:
git submodule add -b master --depth 1 https://github.com/boostorg/boost
Then a git submodule update --init --recursive --remote
would fetch the last commit.
See "Git submodules: Specify a branch/tag".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With