I'm looking for a way to set up git respositories that include subsets of files from a larger repository, and inherit the history from that main repository. My primary motivation is to be able to share subsets of the code via GitHub.
I currently manage my research-related (mostly Matlab) code via a single git repository. The code itself is loosely organized into a handful of folders, with code dependencies that often cross over folders. I don't want to upload a remote copy of the whole repository, because it includes a lot of mixed projects that no one else would want in its entirety.
My mental picture of this involves a separate repository for each project that tracks only the relevant files for that project, but inherits all the commits from the main repository. Ideally, I'd like to be able to tag versions within these sub-repositories separate from the main one, but that's not a necessity. I've looked into git submodules, subtrees, and gitslave, but all of these seem to assume that the subprojects are isolated collections of files, while in my case many subprojects share files with other subprojects. I also attempted to create a project-specific branch, git rm
-ing irrelevant files, but that fell apart as soon as I needed to merge changes from the main branch into the project branch (a mess of conflicts due to changes in project-deleted files).
The stats:
I currently share code by simply copying the relevant files to a new folder periodically for each project. But this means that the new copies have no commit history attached. Is there a more robust method of sharing these various subsets of code, and keeping them up to date with changes I make?
You can use the following commands to use Git subtrees in your repositories. To add a new subtree to a parent repository, you first need to remote add it and then run the subtree add command, like: This merges the whole child project's commit history to the parent repository. Which should you use? Every tool has pros and cons.
You can use the following commands to use Git submodules in your repositories. Clone a repository and load submodules. To clone a repository containing submodules: $ git clone--recursive < URL to Git repo > If you have already cloned a repository and want to load its submodules: $ git submodule update --init. If there are nested submodules:
The traditional method is just to copy the project to the parent repository. But, what if you want to use the same child project in many parent repositories? It wouldn't be feasible to copy the child project into every parent and have to make changes in all of them whenever you update it.
git-subset This is a tool to filter a Git repository for a whitelist of files and folders. It has a more narrow scope than git filter-branchbut is significantlyfaster. Why? This tool was created to limit access to a very large proprietary codebase (for security reasons).
You are looking for git submodules:
It often happens that while working on one project, you need to use another project from within it. Perhaps it’s a library that a third party developed or that you’re developing separately and using in multiple parent projects. A common issue arises in these scenarios: you want to be able to treat the two projects as separate yet still be able to use one from within the other.
The TL;DR on submodules is that they are repos contained within other repos.
The only thing the parent repo knows about the child is the SHA of the last commit that the child told it about, so each repo is managed independent of the other, but they have references to each other which allows you to compose them together.
Here's a well-written blog post from GitHub on the topic.
As I understand your question
git subtree
or git submodules
One way to extract the history of just a subset of the files into a dedicated branch (which you then can push into a dedicated repository) is using git filter-branch
:
# regex to match the files included in this subproject, used below
file_list_regex='^subproject1/|^shared_file1$|^lib/shared_lib2$'
git checkout -b subproject1 # create new branch from current HEAD
git filter-branch --prune-empty \
--index-filter "git ls-files --cached | grep -v -E '$file_list_regex' | xargs -r git rm --cached" \
HEAD
This will
subproject1
based on the current HEAD
(git checkout -b subproject1
)git filter-branch [...] HEAD
)xargs -r git rm --cached
) that are not part of the subproject (git ls-files --cached | grep -v -E '$file_list_regex'
)--prune-empty
).--index-filter
/--cached
).This is a one-time operation though but as I understand your question you want to continously update the extracted subproject repositories/branches with new commit.
The good news is you could simply repeat this command since git filter-branch
will always produce the same commits/history for your subproject branches - given that you don't manually alter them or rewrite your master branch.
The drawback of this is that this would filter-branch
the complete history each time and for each subproject again and again.
Given that you only want to add the last 5 commits of the master
branch to the tip of your existing subproject1
branch you could adapt the commands like this:
# get the full commit ids for the commits we consider
# to be equivalent in master and subproject1 branch
common_base_commit="$(git rev-parse master~6)"
subproject_tip="$(git rev-parse subproject1)"
# checkout a detached HEAD so we don't change the master branch
git checkout --detach master
git filter-branch --prune-empty \
--index-filter "git ls-files --cached | grep -v -E '$file_list_regex' | xargs -r git rm --cached" \
--parent-filter "sed s/${common_base_commit}/${subproject_tip}/g" \
${common_base_commit}..HEAD
# force reset subproject1 branch to current HEAD
git branch -f subproject1
Explanation:
git filter-branch [...] ${common_base_commit}..HEAD
) up to master~6
which we consider to be the equivalent commit to subproject1
s current tip.master~6
to subproject1
(--parent-filter 'sed s/${common_base_commit}/${subproject_tip}/g'
) effectively rebasing the 5 rewritten commits on top of subproject1
.subproject1
to include the new commits on top of it.Further optimazation/automation:
$file_list_regex
) or actually to exclude (git ls-files --cached | grep -v -E '$file_list_regex'
) from a given subproject$GIT_COMMIT
) or check-in the list to the repository itself in case the files to include per subproject may change over timegit update-project subproject1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With