Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specify branch for a git submodule?

Tags:

git

I have added a git submodule to my git repository and it works fine.

In my "parent" repository I have created a feature branch: myfeature that requires some changes to the sub module. But I don't want to affect other teams using the same sub module. Therefore I have created a corresponding feature branch on the submodule repository submodule-feature with some changes. I have then added/committet the change from the submodule directory followed by the same in the root of the parent repository.

But when I switch back to master on my "parent" repository the submodule is still on the submodule-feature feature branch. That is not what I expect. Because now when I run my tests on master they fail because I have introduced some breaking changes in the submodule on the submodule-feature branch.

Is it not possible to lock the branch of a submodule to the parent repository branch?

EDIT: Based on: How can I specify a branch/tag when adding a Git submodule?

Looks like I can specify a branch for the sub module repository in .gitmodules

[submodule "mysubmodule"]
    path = mysubmodule
    url = https://bla.git
    branch = submodule-feature

And adding the following additional git behavior in jenkins:

enter image description here

and:

enter image description here

it clones/checkout the submodule-feature branch when running a build on the parent myfeature branch.

But that will of course require some manual steps when working locally. But from CI side its very easy to accomplish.

like image 793
u123 Avatar asked Jan 26 '23 15:01

u123


1 Answers

The short answer is just mostly "no". While Jenkins has a checkbox, it may not be a good idea to use it here—whether it is or not, depends on who controls the name-to-ID-mapping. Other CI systems may or may not have similar checkboxes. To see what I'm getting at here, read on.

The philosophy of submodules is that the superproject is in control of its submodules. This part, I think, is neither surprising nor objectionable to anyone. But the key lies in the way the superproject controls each submodule. This part does surprise people, and the reason is fairly simple. It's a basic misunderstanding of Git repositories in general.

People think that what matters in Git repositories are branches, or more precisely, branch names like master and develop. That's simply not true. These branches, for the most part, are virtually irrelevant here. For humans, these branch names serve an enormous, overriding purpose. For Git, they serve a mostly-trivial point that's equally well covered by any other name, such as a tag name, or a remote-tracking name, or refs/stash, or HEAD@{17}.1

In Git, the commit, not the branch name (nor tag name, nor any other name), is the central, essential thing. Commits are Git's raison d'être. Without commits, Git has no function. With commits, Git is useful. Commits are actually identified by their hash IDs, whose true names are those big ugly strings like b5101f929789889c2e536d915698f58d5c5c6b7a. Silly things like readable names, like master or develop, are for the weak, the biologicals ... the humans.

Of course, we, being weak humans, like our names. So we use them in our repositories. But when we have a repository that, like a superproject, is controlling another repository, such as a submodule—well, in this case, there are no humans involved. So Git uses the commit ID to control which commit hash ID is extracted in each submodule.

So this is where the surprise comes in—except, once you understand where Git is coming from, it's not surprising at all. When you let the superproject choose the submodule commit, the superproject chooses the submodule commit by hash ID. Any branch names are irrelevant. The hash ID is precise and is always correct. Branch names are sloppy—they move, on purpose, from commit to commit, over time. One commit hash ID can have zero or more branch names that either point directly to it, or can reach it through the commit graph.2

Each and every commit you make in the superproject records the exact submodule hash ID that the submodule is expected to have checked out. Hence, when you git checkout some commit in the superproject, you should generally immediately have each submodule do its own separate git checkout by the hash ID specified in the superproject.3

Remember that each submodule is a Git repository of its own, so it has its own HEAD, index and work-tree. The index in the submodule records the files that are checked out into the work-tree of the submodule, and the HEAD in each submodule is in detached HEAD mode, recording the hash ID of the currently-checked-out commit. It's the superproject's Git that chooses this hash ID—by storing it in a commit in the superproject—and it's the submodule's Git's responsibility to have this particular commit checked out. Nowhere in this process is there any mention of a branch name. Branch names are irrelevant!


1The in-Git function of names, besides of course providing a crutch for weak humans, is to protect objects from being garbage collected. An object is vulnerable to collection if it is not reachable from some name. Since most commits are mostly chained together, one name tends to protect most of the commits in a repository. See also footnote 2.

2For more about reachability, see Think Like (a) Git.

3This doesn't actually happen automatically by default. You have to use git checkout --recurse-submodules or set submodule.recurse in your configuration. Depending on what you're doing—especially, if you're trying to update the submodules—having it happen automatically is either convenient, or extremely annoying.


Why, then, can you set a branch name in the first place?

As you noted, the .gitmodules file can record a branch name. You can also copy this into .git/config (the .git/config setting overrides the .gitmodules setting, if both are set.) But typically, the submodule isn't on a branch at all; it's put into detached HEAD mode, as described above. So what good is this branch name?

The first, but somewhat unsatisfactory, answer is: It's no good at all. Most operations just don't use it.

The second, more satisfactory answer is: A few special-purpose operations do use it. Specifically, if you are in the process of updating the superproject, and want to make a new superproject commit that records a new submodule hash ID, you need some way to pick out the new submodule commit hash ID. There are multiple ways to go about this, and the name is designed for use in one of those ways.

Suppose, for instance, that the submodule is a public repository (perhaps on GitHub) that you don't control. You just use it. Perhaps twice a year, or perhaps 50000 times a day, someone updates the GitHub repository. The new commit(s) they put on their master or their develop or whatever, break(s) a bunch of stuff you use, but that's not a problem, because your superproject doesn't say "get me their latest master or develop commit", your superproject says "get me commit a123456...", and a123456... is always the same commit, forever, until the heat death of the universe, or we stop using Git, whichever occurs first. But, while breaking a bunch of your own sh— err, software, they've introduced a cool new feature that you must have.

What you would like to do at this point is have your Git, the one that's controlling your submodule too, tell your submodule Git: Go get me their latest master or develop or whatever name I recorded earlier. Since you did record that name, you can direct your Git to direct your submodule to do that using:

git submodule update --remote

(to which you can add some extra flags like --checkout or --rebase or --merge, but I'm not going to get into these details—I'm going to assume for now that you just use their latest directly). Your Git has your submodule Git run git fetch and then updates your submodule repository to their latest commit as directed by your submodule's copy of their branch name. (There are now at least three Gits involved in all this—your superproject, your submodule, and the Git repository on GitHub—so it's a little complicated. They, whoever they are, probably have one or more Git repositories they use to control the GitHub one, but at least you don't have to deal with that. Well, not yet.)

Now that your submodule is updated, you must fix up your own code, both to use the the new feature and to deal with all the breaking changes they made to stuff you were already using. So you do all of that, building and testing your software—all on your local machine: there's no CI involved here, not yet—and get it all working. Now you can git add your changes and git add the name of the submodule. Your superproject's index and work-tree now all match up and you are ready to make a new commit in your superproject.

Note that git add submodule-path merely told your Git to record, in your index, the hash ID of the commit that is currently checked-out in the your submodule Git repository. Once again, the branch name, if any, is irrelevant. It doesn't matter if your submodule repository is on branch master or develop, or has a detached HEAD; all that matters is the raw commit hash ID.

You now run git commit to make a new commit. The hash ID from your index, which controls which commit is going to be considered the "right" commit for the submodule, is the commit hash ID you recorded by running git add submodule-path. In this case, that commit ID got selected by the fact that you ran git submodule update --remote earlier. But the only thing that matters is the hash ID in your index, which goes into the new commit.

Now you can git push this commit, that you have made in your superproject Git repository, to some other system, such as your CI system. It can git checkout this commit, and this commit records the right submodule hash ID.

How can I combine this with a CI system so that the CI system picks the hash ID?

This is, obviously, a lot harder may be harder, depending on whether your CI system offers it as a feature.

Now that you know how this is all constructed, though, you have the tools you need. You must have your CI system update (or obtain) its clone of the superproject. That superproject contains, in its .gitmodules file, the URL and path for any submodules that the CI system must clone as well. It may or may not contain some branch name(s) for those submodules.

The CI system must now direct some Git—the superproject Git or the submodule Git—to have the submodule Git git checkout some commit other than the one already recorded as the correct commit, so that the superproject is no longer using the commit that the CI system checked out. In other words, you're no longer building what you submitted to the CI system. You are having the CI system build a new Frankenstein's-monster out of body parts: the main body from your commit, but a limb taken from some other commit that you didn't specify directly: instead, you're allowing someone else to specify which commit goes there. You gave your CI system a name and told it to ask them, whoever they are, what hash ID that name goes to.

Your CI system can now attempt to build and use this Frankenstein's-monster. If it all works well, your CI system will need to make a new commit, that's a lot like your commit except that it records the hash ID it got from them—whoever they are, again—for the submodule in question. Your CI system probably now also needs permission to push this commit somewhere, unless your CI system is also your main repository source-of-truth.

like image 63
torek Avatar answered Jan 31 '23 20:01

torek