Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git submodule and fetch

having trouble getting my head around submodules

They appear to be unnecessarily complicated. Normally I totally avoid them but a project has forced me into this situation

So...

I have a git repo on our dev server with a submodule

/myproject
          /.git
          /files ...
          /other
               /submodule
                         /.git

Now because we run a dev/prod environment we are very restricted in what we can do

How do I

  • Clone the repo to prod such that the prod server is fully populated with the parent git and all the submodules checked out?
  • How do I then...
    • update a file in other/submodule
    • commit it
    • fetch it to the cloned repo
    • then merge it in the cloned repo

We traditionally use a fetch then merge strategy rather than a single pull. Due to the very small size of the team we also do not use a bare repo.

I have tried multiple different methods of achieving the above and none of it seems right. there seems to be a very large number of steps involved so I must be doing something wrong.

Also I do not want the fetch to the prod server to fetch from the submodules remote repo.

Just so you know the project I am working on is a drupal 8 project and it is entirely inappropriate to do dev on production we do not even install composer or drush.

like image 499
DeveloperChris Avatar asked May 09 '18 13:05

DeveloperChris


People also ask

How to fetch all submodules of a specific submodule?

If you're always connected, that's no problem, but if not, there should be an easy way to fetch all submodules up front. There's no git submodule fetch, but there is a command that will do the trick: git submodule foreach --recursive git fetch. This will run git fetch in each submodule, to update it.

How do I update a submodule in Git?

You need to run git submodule init and git submodule update. The first command updates your local .git/config file along with the mapping from the .gitmodules. The second command will fetch the overall data from the submodule project, checking out the mapped commit in the parent project.

What is the default behavior of Git submodule init?

The default behavior of git submodule init is to copy the mapping from the .gitmodules file into the local ./.git/config file. This may seem redundant and lead to questioning git submodule init usefulness. git submodule init has extend behavior in which it accepts a list of explicit module names.

How to clone a project with submodules in Git?

For cloning a project with submodules, you should use the git clone command. It will clone the directories with submodules but not the files within them. You should run git submodule init and git submodule update.


1 Answers

[Submodules] appear to be unnecessarily complicated ...

Probably true. However, submodules are also necessarily complicated. :-) I will also note that submodule support is noticeably better in Git 2.x than it was in the bad old days of Git 1.5 or 1.6 or so, which is when I learned why people called them sob-modules. Some of that history is probably why some of the complexity is here.

Before I dive into the longer answer, here's the short way to get started: use git clone --recurse-submodules, or run git submodule update --init --recursive right after cloning. (The second --recursive is only required if the submodule has submodules of its own.) Adding the --recurse-submodules option to git clone just tells git clone to do that git submodule update --init --recursive after its normal sequence of operations. Note that this won't help you with the process of working within the submodules, though.

Long

How do I ...

Git is a tool, not a solution (a common saying in the construction business, apparently, but generally applicable to most technology). As with most tools, there are multiple ways to use them.

The thing to know about a submodule is that each submodule is just another Git repository. The only thing that makes a Git repository a "submodule" is the fact that there is some "outer later" repository that is controlling the inner repository in some way. From within the inner repository, we refer to the outer one as the superproject.

Within any Git repository in which you will do any work, you have a work-tree. The work-tree holds the files in their ordinary everyday form, where you (and the other programs on your computer) can work with them. Each Git repository also has an index, which is where you build up the next commit you will make. The index is also called the staging area and sometimes the cache, reflecting either its extremely important roles, or the poor choice of the word "index" for its original name (or perhaps both). And, of course, each Git repository has a collection of commits, with various branch names and/or tag names that identify specific commit hashes by some sort of human-readable name.

If that Git repository were standing on its own, those names—the branch and tag names—would be the useful ones to us humans, doing work in that repository. But we've just declared that this repository is a submodule that lives (or dies) at the command of some other repository—the superproject. Our own branch and tag names are nearly useless. They become useful if and when we treat this repository as a regular repository, not a mere adjunct to some superproject. When we treat this repository as a controlled entity, we want this repository to have a detached HEAD instead. The superproject, not the submodule therein, dictates the commit hash to check out, not by some sort of human-readable name, but by raw hash ID.

This feeds into all of the "how do I" answers. The superproject records, in the superproject's index, by its raw hash ID, the specific commit that should be checked out in the submodule.

Cloning

[How do I] Clone the repo ... such that [the clone] is fully populated with ... all the submodules checked out?

Like any clone, this can be made via git clone url [dir], which really consists of about six steps:

  1. Create a new, empty directory dir and switch (cd) to it, or use some existing empty directory if so told: ([ -d dir ] || mkdir dir) && cd dir. (If this fails, stop, don't do any of the remaining steps. If a subsequent step fails, remove the new directory if we made it, and remove all the file we made, leaving no trace of the partial failed clone.) If we don't give git clone a directory name, it computes one from the url argument.
  2. Create a new, empty repository: git init. This creates the .git directory and an initial configuration.
  3. Do any required additional configuration from -c options given after git clone.
  4. Add a remote given a url: git remote add remote url. The usual name for the remote is origin but you can control this with the -o option.
  5. Obtain commits from the remote: git fetch remote.
  6. Check out some branch or tag name: git checkout name. If this is a branch name, the branch does not exist yet, so this creates the branch the same way that git checkout does. If this is a tag name, this checks out the commit as a detached HEAD. The name here is the one you gave with a -b option. If you did not give one, the name is obtained by asking the Git at the other end of the git fetch operation which branch it recommends, which is pretty commonly main. If that also fails—if the other Git has no name to recommend—the name used is main.

The last step, step 6, checks out some specific commit, typically by getting "on" a branch such as main, creating that branch name based off the names obtained during step 5 (git fetch which made origin/main). The act of checking out this particular commit fills in the repository's index and work-tree, so that now you have in your work-tree all the files required.

Submodules and gitlinks

If the commit you just checked out has submodules, it has a file named .gitmodules and has, in that commit that you just checked out, one or more special entries each called a gitlink. A gitlink entry looks much like a file (blob) entry or a tree entry, but has type-code 160000 rather than 100644 (regular file) or 100755 (executable file) or 004000 (tree).1 These gitlink entries go into your index, and your Git creates an empty directory at the path given by the gitlink, the same way your Git would create a subdirectory for a tree or a file for a blob.2 The hash ID associated with these gitlink entries—every index entry has a hash ID—is that of one particular commit in the submodule, which Git can, but won't just yet, check out as a detached HEAD.

Note that I said here if the commit you just checked out has submodules. This is another key realization: the "submodule-ness" of a submodule is controlled by the specific commit in the superproject. That commit needs to have a gitlink entry, to give the hash ID to check out in the submodule, and a .gitmodules file. But what is this .gitmodules file for?


1There's one more index type-code, 120000, for symbolic links. These are handled almost exactly the same way as blob objects except that as long as symlinks are enabled, Git writes the contents as a symlink rather than as a file. If symlinks are disabled, Git writes the contents as a regular file, so that you can edit it and re-add it as a symlink later using git update-index, if you know all the magic for dealing with index entries.

2The fact that Git will create an empty directory for a tree object has led people to try to use Git's semi-secret empty tree to store empty directories. Unfortunately, the index itself has weird corner cases here and Git turns the empty tree into a gitlink entry under various conditions. This then acts as a broken submodule—a gitlink without a .gitmodules entry—which makes Git behave slightly badly.


The .gitmodules file

We just saw, above, that git clone needs at least one argument: the url for the repository to clone. The superproject stores the desired commit hash ID in the gitlink, but how will it know what url to use? The answer is to look in the .gitmodules file.

The contents of a .gitmodules are formatted the same way as .git/config or $HOME/.gitconfig or any other Git configuration file, and in fact, Git uses git config to read them:

git config -f .gitmodules --get submodule.path/to/x.url

This looks for

[submodule "path/to/x"]
    url = <whatever you put here>

in the .gitmodules file, and when we find it, that provides the URL.

In fact, the contents will be:

[submodule "path/to/x"]
    path = path/to/x
    url = <whatever you put here>

and perhaps also one or both of:

    branch = <name>
    update = <control>

The path must correspond to the relative path of the submodule within the superproject, and the name of the submodule must be the relative path of the submodule within the superproject. (What happens if one or the other of these are wrong / don't match, I am not quite sure. Git's submodule commands generally make sure they do match, so that the question never arises.)

This lets git submodule find the URL to make the clone. This process is complicated. When you run git submodule init or git submodule update --init, Git will copy the url setting from .gitmodules to .git/config. If there is an update = control setting, it will copy that too, unless there's already a setting in .git/config. (This is one of those "unnecessary complications" you mention, though I think it's to correct for historical mistakes.)

Without --init, the git submodule update command will only look at the entries in .git/config, not the ones in .gitmodules. This means you could use the two step sequence git submodule init && git submodule update to do the same thing, but git submodule update --init is easier to enter. More importantly, git submodule init does not have a --recursive option while git submodule update does. This is actually sensible, because git submodule init only copies from .gitmodules to .git/config (see below for more about this). The git submodule update operation actually creates the clone, using the six-step process outlined above.

Detaching HEAD onto the correct commit in the submodule

We saw that the superproject lists the correct hash ID for the submodule, as a gitlink entry. This means Git needs to start in the superproject, read the gitlink entry out of the index, then switch into the submodule (cd path) and git checkout the correct commit by its hash ID. That will result in a detached HEAD with the correct commit checked out.

The command that does this is git submodule update. And, that's usually what we want: to check out that specific commit, by its hash ID, as a detached HEAD. Now that we've gotten what we want in the submodule, we're done ... or are we? What if this Git repository—remember, each submodule is an ordinary Git repository, in its own right—what if this Git repository has submodules of its own?

Submodules can have submodules

If this submodule has its own submodules, we now want this sub-Git to git checkout the correct commit, run git submodule init to initialize its .git/config for its submodules, and run git submodule update to make its own submodules get checked-out to the correct commit. That's just what git submodule update is already doing on behalf of our superproject, so we just want this git submodule update to recursively operate on the submodule's submodules. This means that git submodule update needs to be able to recurse into submodules and also --init them.

So that's why git submodule update --init --recursive exists: it's the workhorse that goes into each submodule from the superproject, sets up its .git/config if needed, checks out the correct detached-HEAD hash, and then recurses on submodules of the submodule.

git clone can invoke git submodule update

If we now rewind all the way back to git clone, we can see that what we need after step 6 is a step 7: git submodule update --init --recursive, to go into each submodule listed in the superproject and initialize it and check out the correct detached HEAD, and if that submodule is a superproject of additional submodules, handle them recursively. In the end, we'll have the superproject, with its particular commit, controlling all of its submodules which are on the correct commit as a detached HEAD, and for each of those submodules that is itself a superproject with submodules, the submodule-as-superproject's commit will control the submodule-as-superproject's submodules, recursively.

If you don't have recursive submodules, all of the recursion winds up doing nothing: it's a little bit of extra work but is harmless. So this is usually the way to go: just run git clone --recurse-submodules and you get the clone created with its submodules checked out as detached HEAD repositories, and you are done.

Working within the submodules

You had what is almost a separate question:

How do I then update a file in other/submodule?

We saw above that the way a superproject controls / uses a submodule is by having the superproject specify, by absolute hash ID, which commit the submodule is to be locked into, as a detached HEAD. That's great for controlling and using the submodule, except when we want to update the submodule to some newer commit.

The traditional answer, dating back to the Git 1.5 days, is that since the submodule is a Git repository, just cd into the submodule and git checkout <branchname> and start working. This still works! It has an obvious drawback, though: how do you know which branch name to use?

In some cases, you just know. That's fine; go ahead and use them that way. If you want the superproject to know, though, this is where the superproject's branch = setting comes in, and where arguments to git submodule update and/or the submodule.name.update settings (also in the superproject) come in. Remember, these settings from from the .git/config file in the superproject, not from the submodule itself, and (normally3) not from the .gitmodules file either—but the .gitmodules file contents set up the default .git/config settings. So there are a lot of ways to control this configuration.

Next, there's the question of what each configuration does, and how you want to set it up for your own purposes. These are enumerated (rather poorly in my opinion) in the git submodule documentation. Here's my own summary of their summary, with additional commentary.

  • checkout: the commit recorded in the superproject will be checked out in the submodule on a detached HEAD.

    This is the default and is what we saw above.

  • rebase: the current branch of the submodule will be rebased onto the commit recorded in the superproject.

    This isn't useful unless you've already gone into the submodule and done something there. However, there's also a --remote option described later in the documentation, which makes it more useful.

  • merge: the commit recorded in the superproject will be merged into the current branch in the submodule.

    As with rebase, this isn't useful by itself: you need either --remote or to do your own work in the submodule before doing this.

  • custom command: arbitrary shell command that takes a single argument (the sha1 of the commit recorded in the superproject) is executed.

    This one is useful by itself, but requires that you do some up-front work in the superproject, to set up the configuration and define the command.

  • none: the submodule is not updated.

    This is primarily useful to mark a submodule that doesn't get updated when all the other submodules of this particular superproject do. If you have only one submodule, this setting has no function at all.

So far, we have not seen any use for the branch setting copied from .gitmodules to .git/config. It's this --remote option, described further down in the same documentation, that talks about how this setting is used:

... Instead of using the superproject's recorded SHA-1 to update the submodule, use the status of the submodule's remote-tracking branch.

That is, the superproject has a gitlink entry that says use hash a1b2c3d... or whatever, but instead of using that hash, when the superproject git submodule update command goes poking around with the Git repository holding the submodule, the superproject command will look up, e.g., origin/main in the submodule. The name main here comes from that branch setting, so setting submodule.name.branch to, say, develop instead will make the superproject use origin/develop instead of origin/main.4

To make this useful, the superproject Git runs git fetch in the submodule before starting any of this. That causes the submodule to bring over any new commits from its origin Git, updating its origin/main, origin/develop, and so on. The assumption here is that you did not do any work in the submodule yourself! You are just grabbing work that someone else did in the origin repository from which the submodule repository was cloned (whew!).


3The setting in .gitmodules will be used if there is no setting in .git/config and no override on the command line. I think this is yet another backwards-compatibility item.

4This assumes that origin/develop is the remote-tracking name associated with branch develop in the submodule repository, i.e., that things are set up as normal.


Preparing the updated submodule

If you are about to do your own work in your own submodule, none of this helps you at all. Instead, you should just cd into the submodule repository and run git checkout branchname. That will take you off your detached HEAD and put you on the given branch, and now you can work normally. Write code, git add, and git commit as you normally would. When everything is ready in the submodule, cd back to the superproject. You will have your submodule on a branch (not in detached HEAD mode), on some particular commit.

If you are just picking up someone else's work, this git submodule update --remote --checkout or whatever will git fetch and then git checkout origin/main or whatever, as appropriate, in the submodule. That will leave your submodule on no branch, in detached HEAD mode, on some particular commit. This is likely what you want.

Using the updated submodule within the superproject

Either way, from the superproject's point of view, what has happened is that the submodule is now on a different commit. The superproject does not care whether the submodule's HEAD is attached or detached; what matters is the current commit in the submodule.

Now that the submodule is on the desired commit, make any other changes you want in the superproject—maybe there is some file that should use some new feature of the submodule, for instance. When you are done making the required changes, git add any updated files, and also run git add on the submodule path (without a trailing slash):

git add features.ext   # updated to use feature F of submodule sub/S
git add sub/S          # record the new gitlink for sub/S!

This updates the superproject's index, so that now we have not only the updated file (features.ext) but also the new correct hash ID for the submodule—the updated gitlink. Now we can run git commit in the superproject as usual:

git commit

and this makes our new commit, which has a gitlink that records the fact that submodule sub/S should be checked out with a detached HEAD at commit f37c219... or whatever the current commit of sub/S actually is. This new commit goes on whatever branch we have checked out in the superproject, whether that's main or develop or whatever.

Pushing

Let's say we did our own work in sub/S, on its branch devel, creating commit f37c219.... Then we made our new commit in our superproject on the superproject's main; by some strange chance its hash ID is abcdef1.... Now that we have two repositories with updates, we can git push them. But there is an order constraint!

Suppose we push our superproject now:

git push origin main

Our new commit abcdef1 goes to our upstream repository, and that Git's main now names our new commit abcdef1. Our new commit says that submodule sub/S should be checked out at commit f37c219. So Fred, over on Fred's computer, runs git clone or git fetch or whatever it is and gets our commit abcdef1 that says "use commit f37c219... when using sub/S". Fred runs git submodule update and his Git goes into his sub/S and tries to check out f37c219 and, whoops, Fred doesn't have f37c219. In fact, only we have f37c219, because we just made it!

We'd best very quickly cd sub/S and run git push origin develop. (Remember, we made our f37c219 on our develop in our submodule.) That way, when Fred tries to access f37c219, it's at least available somewhere. It's better if we git push that one first, then git push origin main in the superproject, to push abcdef1 which refers to f37c219. So this leads to update rule #2: push the submodules first, in deepest-submodule order. That way each superproject refers to a commit that Fred—or whoever—can get to.

There is still one more minor pain point for Fred

We introduced Fred above as the first guy to fetch (and merge or rebase or otherwise incorporate, perhaps even git pull) our superproject commit that refers to new subproject commits. However, Fred here stands in for anyone who has cloned our superproject. They all have our superproject, and they all ran git submodule update --init --recursive, perhaps as part of the very clone command that got them the superproject, so they have all the submodules already.

But they don't have any of the new commits in the submodules yet. When they update their superproject and tell their Git to git submodule update, their Gits will go into their submodules and not find the right commit hashes. Fortunately, git submodule update is smart enough to run git fetch for you (or for Fred).

For this to work, though, whoever is updating has to be on line. This means you must run git submodule update when connected. If you're always connected, that's no problem, but if not, there should be an easy way to fetch all submodules up front.

There's no git submodule fetch, but there is a command that will do the trick:

git submodule foreach --recursive git fetch

This will run git fetch in each submodule, to update it. That way a later git submodule update, used with any commit in the superproject, will work even if you are off line and the submodules would have required updating.

like image 81
torek Avatar answered Oct 23 '22 08:10

torek