Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git combining two files into one with history preserved

Tags:

git

file

merge

Imagine that you have two files in a git repository, say A.txt and B.txt.

Is it possible to concat the two files into a third one A+B.txt, removing the original A.txt and B.txt and committing it all, so the history is still preserved?

That is, if I asked git log --follow A+B.txt I would know that the content originated from the A.txt and B.txt files?

I've tried to separate the files into two different branches and then merging them into a new file (while removing the old ones), but to no avail.

like image 388
Peter Uhnak Avatar asked Oct 06 '17 17:10

Peter Uhnak


People also ask

How do I merge two git repository and keep history?

To combine two separate Git repositories into one, add the repository to merge in as a remote to the repository to merge into. Then, combine their histories by merging while using the --allow-unrelated-histories command line option.

Does git merge preserve history?

In the Conceptual Overview section, we saw how a feature branch can incorporate upstream changes from main using either git merge or git rebase . Merging is a safe option that preserves the entire history of your repository, while rebasing creates a linear history by moving your feature branch onto the tip of main .

Can you merge files in git?

Git can handle most merges on its own with automatic merging features. A conflict arises when two separate branches have made edits to the same line in a file, or when a file has been deleted in one branch but edited in the other.


2 Answers

The long answer is 'yes'!

Full credit to Raymond Chen's article Combining two files into one while preserving line history:

Imagine you had two files: fruits & veggies

git blame for both fruits and veggies

The naïve way of combining the files would be to do it in a single commit, but you'll lose line history on one of the files (or both)

You could tweak the git blame algorithms with options like -M and -C to get it to try harder, but in practice, you don’t often have control over those options (eg. the git blame may be performed on a server)

The trick is to use a merge with two forked branches

  • In one branch, we rename veggies to produce.
  • In the other branch, we rename fruits to produce.
git checkout -b rename-veggies
git mv veggies produce
git commit -m "rename veggies to produce"
git checkout -
git mv fruits produce
git commit -m "rename fruits to produce"

Then merge the first into the second

git merge -m "combine fruits and veggies" rename-veggies

This will generate a merge conflict - that's okay - now take the changes from each branch's Produce file and combine into one - here's a simple concatenation (but resolve the merge conflict however you please):

cat "produce~HEAD" "produce~rename-veggies" >produce
git add produce
git merge --continue

The resulting produce file was created by a merge, so git knows to look in both parents of the merge to learn what happened.

git blame for produce

And that’s where it sees that each parent contributed half of the file, and it also sees that the files in each branch were themselves created via renames of other files, so it can chase the history back into both of the original files.

Each line should be correctly attributed to the person who introduced it in the original file, whether it’s fruits or veggies. People investigating the produce file get a more accurate history of who last touched each line of the file.

For best results, your rename commit should be a pure rename. Resist the temptation to edit the file’s contents at the same time you rename it. A pure rename ensure that git’s rename detection will find the match. If you edit the file in the same commit as the rename, then whether the rename is detected as such will depend on git’s “similar files” heuristic.

Checkout the full article for a full step by step breakdown and more explanations


Originally, I had thought this might be a use case for git merge-file doing something like this:

>produce echo #empty
git merge-file fruits produce veggies --union -p > produce
git rm fruits veggies
git add produce
git commit -m "combine fruits and veggies"

However, all this does is help simulate the merge diffing algorithm against two different files - the end output when committed is identical to if the file had been updated manually and the resulting changes manually committed

like image 169
KyleMit Avatar answered Oct 21 '22 01:10

KyleMit


The short answer is "no" (or perhaps even Mu). (But for a way to get useful synthesized line history for a combined file via git blame, see KyleMit's answer.)

History, in Git, is the set of commits. There is no such thing as "file history": you either have a commit, or you don't, and that commit has one or more parents, or it doesn't. This means that "file history" as a thing doesn't exist—and yet, git log --follow exists. This is self-contradictory: How can git log --follow produce a file history, if file history doesn't exist?

The answer is that git log --follow cheats. It doesn't really find file history. It looks through history and constructs a sub-history by changing the (single) name of the file it is looking for. It looks at each commit, one at a time, and runs a (sped-up, limited) git diff --find-renames of that commit against its parent.1 If the diff says that file X.txt in the parent was renamed to A.txt in the child, and you're running git log --follow A.txt, the code in git log now starts looking for X.txt.

Since there's no code to start looking for more than one file at a time, you can't get this particular cheat to accommodate your desired situation, which is to go from looking for one particular file to more-than-one file. (There are actually two problems here. One is that, due to the rather limited internal implementation,2git log --follow can only look at one file at a time. The other is that rename detection does not include "combine detection": there is a form of "split detection", in which Git will do copy-finding, enabled with --find-copies and --find-copies-harder. The latter is very compute-intensive, and both are working in the wrong direction here, although it could be made to do the right thing simply by reversing the order of the diff.)


1As this implies, --follow doesn't look at merge diffs at all, at least by default. See also `git log --follow --graph` skips commits.

2aka "cheesy hack"

like image 5
torek Avatar answered Oct 21 '22 02:10

torek