Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I refactor my code to a new file and preserve git history?

So I want to extract a part of a large file to a new one and to preserve git history, so I'll be able to run git blame and see the changes as before the refactoring.

like image 725
valk Avatar asked Oct 18 '22 22:10

valk


1 Answers

In Git, the history is the commits. There is no file history. This is unlike most other version control systems: Those other VCSes that do track "file identity" need you to inform them that new file path/to/new.ext is derived from path/to/existing.ext so that they can associate the new file's history with the old file's history. Similarly, they need you to inform them about file renames—although some, like ClearCase, can auto-detect renames by simply acting as the file system for the work-tree. Git does not need any of this because it does not work that way.1

Instead, in Git, when you compare one commit—call it a—to another (b), Git attempts to discover (dynamically, at compare-time) whether some file a/path/to/name is "the same" as another file b/some/other/path/to/anothername. The degree of comparison and the algorithm for deciding that these are the same file, or are different files, are up to the Git command. The git diff command starts by looking at the actual path names: if they are the same, the files are the same,2 otherwise they are probably different. The "probably" part is where rename detection comes in, if you have enabled it. A regular git diff also has -C and --find-copies-harder to enable "file-copied-from" detection. Using -C twice (or --find-copies-harder) sets things up to look for new files being copied from any file in the a commit (this is considered too expensive to do automatically; normally, only files that are otherwise considered "modified" are treated as source-of-copy candidates).

The git blame command is somewhat different (and the a and b commits are just automatically parent-and-child of each commit), but it still has a -C option. Its -C works a bit differently: one -C looks for lines copied from files modified between commits a and b. Using -C twice looks for such lines copied from any file in commit a, and with three -C flags, it will "find copies harder still": it will look at every file in every commit to find copied code.

Hence, for most purposes you can just use one -C on your git blame. You should use -C -C if the copied code comes from a non-modified file. Use three -Cs if you believe some code was deleted many revs ago, then resurrected, and you want to find the original source. Note that git blame's -C option turns on git blame's -M option, which detects moved code (and is therefore quite different from git diff's -M option—file rename detection, a la git log --follow,3 is always enabled).


1This is a nice advantage for Git over other VCSes, because Git can detect cases that humans forgot, and also can detect renames when comparing "far apart" revisions. It's a terrible disadvantage for Git, because it must detect cases even if humans would not have forgotten, and hence misses renames. It's a big advantage for Git, because future smarter algorithms use the existing data in better ways. In short, there are arguments for why it's better and why it's worse, but ultimately it's just different.

2For git diff, you can conditionally break apart these automatically-paired "same name means same file" pairings using its -B option. This is unavailable to, yet unnecessary for, git blame, which is not doing this kind of pairing.

3The code enabled by --follow in git log is a horrible hack that basically only works for the one case required by git blame. Do not try to use --follow with reverse-order git log.

like image 146
torek Avatar answered Oct 21 '22 08:10

torek