Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there multiple files stored in the `stage` after `git add`?

Tags:

git

As far as I know,after each operation of git add "filename",the git will create an corresponding object,and restore a pointer to that object in the index.

Here is an example:

 touch a.txt 
 git add a.txt//version 1
 modify a.txt 
 git add a.txt//version 2

Are there two files stored in the index,which corresponds to version1 and version2?Or just exist the newest version2?

Will the version1 in index be deleted and only restore the version2?

like image 899
scottxiao Avatar asked Mar 07 '23 14:03

scottxiao


2 Answers

Only the latest version added will be accessible from the index.

You can use git ls-files --stage to inspect what files are currently in the index.

What happens when you git add something is that a new blob object will be created. This object is then referenced from the index. So if you add a change, then find the blob hash for the resulting file (by for instance git ls-files --stage) and keep ahold of this blob hash, then you will be able to find it even if the index is updated to record a different blob for the given file, until it eventually gets garbage collected.

An example:

Create file foo with contents hello blob and add it to the index.

$ echo "hello blob" > foo && git add foo

Find the blob reference from the index and show the blob contents.

$ git ls-files --stage | grep foo
100644 038f48ad0beaffbea71d186a05084b79e3870cbf 0   foo
$ git cat-file -p 038f48ad0beaffbea71d186a05084b79e3870cbf
hello blob

Replace the contents of foo with good bye blob and add it to the index.

$ echo "good bye blob" > foo && git add foo

Find the blob reference from the index and show the blob contents.

$ git ls-files --stage | grep foo
100644 d026852604f5986457e8867e2cb27b4cddb24e6f 0   foo
$ git cat-file -p d026852604f5986457e8867e2cb27b4cddb24e6f
good bye blob

The original blob still exists until it gets garbage collected.

$ git cat-file -p 038f48ad0beaffbea71d186a05084b79e3870cbf
hello blob

Documentation/technical/index-format.txt gives more information about the internal format of the index.

like image 146
jsageryd Avatar answered Mar 16 '23 07:03

jsageryd


The answer to your question as asked is actually a bit tricky. It's mostly no, but there are certain conditions—namely, conflicted merges—under which the index / staging-area can hold multiple versions for a single file. Moreover, for some time, it's possible to retrieve any added version, but there's a hitch.

About the index / staging-area / cache

This thing called the staging area has three names, to reflect its importance and/or because it has multiple roles and/or because the first name for it, the index, is kind of a poor name. :-) The three names are the index, the staging area, and the cache.

What the index does is to cache information (hence the cache name) about the work-tree. As a result, it acts as an index for looking things up quickly about the work-tree (hence the index name). The purpose of all of this is so that you can stage things for committing (hence the staging area name).

The main problem with all of this is that it is difficult to see the index and hence difficult to form a mental model about it. Note that the work-tree can contain files that aren't in the index: these are, by definition, untracked files. This means that listing the work-tree contents doesn't tell you what's in the index.

The main purpose of the index is to hold whatever will go into the next commit. That is, git commit actually makes the new commit's content—the snapshot—using the index / staging-area. Whatever is in the index now, that is what is in the new commit.

Viewing the index directly

To see what's in the index as a list of file names, use git ls-files. To see this in more detail, use --stage:

$ git ls-files --stage
[snip]
100644 d76e13c8524003fcc5c55d706c1177f66520b9d4 0       builtin/checkout.c
100644 fad533a0a7382f10ecf48a738c955734ad5c0d96 0       builtin/clean.c
100644 101c27a593f4c64a735410f18bfcb46489728696 0       builtin/clone.c
[snip]
100755 067e9e7f4440a4a4a6f0dda6875fd3f840e694de 0       GIT-VERSION-GEN
100644 60e515eaf7432e77d8db8837b7bb95e105ca2126 0       INSTALL
100644 d38b1b92bdb2893eb4505667375563f2d6d4086b 0       LGPL-2.1
100644 a1d8775adb4b38a0340cd7d04184915f0ee65d28 0       Makefile
100644 f17af66a97c8097ab91f074478c4a5cb90425725 0       README.md
[snip]

This shows how the index holds a copy (a version, as it were) of each of the files in the work-tree. The first entry gives the file's mode. This is always1100644 or 100755 for ordinary files: 100644 means it's not executable, while 100755 means it is executable. The second, which looks a lot like a commit hash ID, is the blob hash ID of the staged content for that file. The third number is called the stage number: it's usually zero, and I'll go into more detail in a moment. After these three numbers, git ls-files --stage prints a tab character (ASCII code 9) followed by the file's full name, including any directory part and always using forward slash.


1In the distant past, Git stored more than just "executable" or "not executable". That turned out to be a mistake, so now the only two normally-allowed values are the 100644 and 100755 numbers—but it also explains why the numbers have this many bits in them. There are other file types, such as 120000 for a symbolic link. The types mostly match the Linux/Unix inode bits.


The index version usually matches at least one of the HEAD or work-tree versions

You can see what's in your current commit using:

$ git ls-tree -r HEAD
[snip]
100644 blob d76e13c8524003fcc5c55d706c1177f66520b9d4    builtin/checkout.c
100644 blob fad533a0a7382f10ecf48a738c955734ad5c0d96    builtin/clean.c
100644 blob 101c27a593f4c64a735410f18bfcb46489728696    builtin/clone.c
[snip]

Compare this to the first section of git ls-files --stage output I retained above. Except for the insertion of the word blob and the deletion of the stage number zero, these match exactly.

What's going on here is that when I checked out this commit from the Git repository for Git, Git extracted the files from the current commit:

$ git rev-parse master
0afbf6caa5b16dcfa3074982e5b48e27d452dbbb

into the index and work-tree. In the index, they're just stored as blob hashes—the same hash as the commit itself, and in fact, they're sharing the internal Git blob object holding the file's content. We can view the content directly using the hash IDs; for instance, README.md has hash ID f17af66a97c8097ab91f074478c4a5cb90425725, so:

$ git cat-file -p f17af66a97c8097ab91f074478c4a5cb90425725
Git - fast, scalable, distributed revision control system
=========================================================

Git is a fast, scalable, distributed revision control system with an
unusually rich command set that provides both high-level operations
and full access to internals.
[snip]

Note that these blob objects, stored by their hash ID in the Git repository object database, are in a special, read-only, Git-specific form. That's one reason we have to use git cat-file to read them, rather than just looking at .git/objects/.... Another reason is that they can get packed and then there's no individual object to examine.

Running git add creates or re-uses a blob

When you run git add path, Git copies the work-tree version—which is in its ordinary form and is generally read/write, so that you can change it—into the internal form, as a blob object. If the file's content exactly matches some existing blob object, Git will just re-use that existing blob object. Otherwise, Git will create a new read-only blob object to hold that content.

In any case, the result of this re-use existing or create new process is that there is now a blob object in the repository's object database, and this blob object has a hash ID. Git can now stuff the new hash ID into the index. For instance, if I modify the README.md file:2

$ ed README.md 
3001
1i
hello world
.
w
3013
q
$ head -3 README.md
hello world
Git - fast, scalable, distributed revision control system
=========================================================

and then git add this, I get a new entry in the index:

$ git add README.md
$ git ls-files --stage -- README.md
100644 331117c13c79da78d15ad24c2111c15eeef56ddf 0       README.md

and the blob hash matches that from git hash-object:

$ git hash-object README.md 
331117c13c79da78d15ad24c2111c15eeef56ddf

and now I have a different version of README.md in the index.

If I change README.md again and add it again, the staged hash ID will change. Let's change the added line, predict the hash, add the file, and see:

$ head -3 README.md
jello world
Git - fast, scalable, distributed revision control system
=========================================================
$ git hash-object README.md
2be1367cce5788ee15d4de758fc95599721dd1f4
$ git add README.md
$ git ls-files --stage -- README.md
100644 2be1367cce5788ee15d4de758fc95599721dd1f4 0       README.md

What happened to the intermediate one we added? Well, at least for the moment, it still exists:

$ git cat-file -p 331117c13c79da78d15ad24c2111c15eeef56ddf | head -3
hello world
Git - fast, scalable, distributed revision control system
=========================================================

(In fact, due to the rules for pruning unreferenced objects, it will stick around for 14 days by default.)

If I want the README.md file to go back to the version in the HEAD commit, I can use git checkout to extract that into the index, and then on into the work-tree:

$ git checkout HEAD -- README.md
$ git ls-files --stage -- README.md
100644 f17af66a97c8097ab91f074478c4a5cb90425725 0       README.md

Note that it's possible, but unusual, to git add a file, then change it some more. When you git add-ed it, you get a snapshot of what was in the work-tree at the time; when you changed it more, you made another version; and now the index version may differ from both the HEAD version and the work-tree version.

Using git add -p does this same sort of thing: after applying each diff hunk, it git adds the patched version that it's building up. As a result, you can get many objects that must be cleaned-up 14 days later. For instance, if you have 14 diff hunks and add 13 of them, you get 12 intermediate versions in the object database, plus the 13th that you actually commit. This is normal and pretty harmless unless you are using git add -p on a twelve terabyte file. :-)


2I use the ed editor here since it's reproducible in plain text. The first number, 3001, is the file's initial size in bytes. The 1i command inserts text before line 1, and . on a line by itself ends the insertion. The w command writes the file, which reports the size in bytes, and q exits the editor. You can see that I added twelve bytes: "hello" and "world" are 5 each, the space is one, and the final newline is the twelfth.


Viewing the index indirectly

When you run git status, you will often see changes staged for commit and changes not staged for commit. What is really going on here is that Git is running two comparisons—two git diffs, more or less. One compares the current or HEAD commit to the index. Whatever is different here is staged for commit. Whatever is the same here, it says nothing about.

But then it goes on to compare the index contents to the work-tree contents. Whatever is different here is not staged for commit. Whatever is the same here, it says nothing about.

Viewing the index indirectly like this is generally more useful than viewing it directly. We mostly want to know: What, in the next commit I am about to make, is already changed from the current commit? In other words, what's the difference between HEAD and the index? And: What else could I git add to make the index more different? In other words, what's the difference between the index and the work-tree?

Since these are the most useful to know, these are what git status shows.

Nonzero stage numbers: how Git merges files

All this tricky stuff with the staging area is somewhat useful, especially for git add -p, but in the end you could easily get away without it. Mercurial, which is very similar to Git in many ways, has no staging area—you just run hg commit and it commits whatever is in the work-tree.3 The work-tree is your staging-area: you just edit and go.4

During git merge, though—or actually anything that invokes Git's merge machinery, the part of merge that I like to refer to as merge as a verb—the index takes on a much bigger role, and it's here that the stage numbers actually matter.

To do any ordinary three-way merge, Git needs three inputs: a merge base commit, the current (HEAD) commit, and the commit you wish to merge. Each commit typically has one version of each file. Git can now write, into the index / staging-area, all three versions, using the stage number that's ordinarily just zero:

$ git merge br1
Auto-merging file
CONFLICT (content): Merge conflict in file
Automatic merge failed; fix conflicts and then commit the result.
$ cat file
base file
<<<<<<< HEAD
different stuff
||||||| merged common ancestors
=======
new stuff
>>>>>>> br1
$ git ls-files --stage
100644 56a073d1bdc0307f357a407b50bee4324bb55873 0       README
100644 4bbeee7f6db44d17c38abd3031fcb84a97192459 1       file
100644 66c11225b90f34558b8ace8c5eb203ec47b55c38 2       file
100644 7502e0fdd9ef370b5eee377ceeee4b1605d88336 3       file

(I have merge.conflictStyle set to diff3, so I see all three versions in the work-tree file named file here.)

Note that the index now has three entries for file file. The stage-1 entry, with hash ID 4bbeee..., is the merge base version. The stage-2 entry, hash ID 66c11..., is the HEAD or --ours version. The stage-3 entry, 7502e..., is the version from the branch br1.

The README is the same in every version—all three hashes matched—so Git was able to leave it alone, and it's at stage zero.

During the merge operation—which happens only if all three versions differ—if Git is able to combine the three versions on its own, it will do so and put the merge result in as stage zero. If not—if there are any merge conflicts—Git will leave all three versions in as the three stages, and write its best guess at the merge, plus the conflicts, into the work-tree.

The thing that makes this so easy if two or three of the hash IDs match is that we then just have the following cases:

  • All three hash IDs match: we didn't touch the file and they didn't touch the file, so keep the base / ours / theirs version of the file. They are all the same!
  • Two of the three IDs match. Then one of the following must be true:
    • The base version matches ours: only they changed it; take their file.
    • The base version matches theirs: only we changed it; take our file.
    • The base version is different, but ours matches theirs: take ours or theirs, whichever is easier, since we both made the same changes.

Hence Git only has to run the full three-way merge algorithm on files where all three versions differ. Even then, as I noted above, Git may be able to resolve the merge on its own. It's only if it can't that it puts all three versions into the index and stops with a merge conflict.

Once Git has stopped with a merge conflict, note that git add takes whatever's in the work-tree, writes it into a blob object, then writes the ID into slot zero and removes the three higher-stage entries. That's what marks the conflict as resolved.


3This isn't quite true: for new files, you must hg add the file. The principle should be clear enough, though. Mercurial has a secret place it keeps much of what's in Git's index, which Mercurial calls its dirstate, but you do not have to know about it. The dirstate does not hold multiple versions the way Git's index does, though.

4If you use git commit -a, Git effectively—but not actually; there's special sneakiness under the hood—runs a git add -u to update the index just before doing the commit. This makes Git act a lot like Mercurial: you only have to git add new files. There's some asymmetry here with removed files, though. Mercurial used to remove them automatically, just as git add -u will, but that turned out to be a mistake. I recommend not using git commit -a as it will probably bite you eventually.


Summary

This is what you should know about the index / staging area / cache:

  • It has three names. They all mean the same thing; they just emphasize different roles, or were typed in by different people.

  • It's the place you build up the next commit to make. Running git commit will use whatever is in the index right then to make the commit (ignoring special weirdness from git commit -a or git commit --only <file> or git commit --include <file>).

  • It generally starts out matching the current commit (but it doesn't have to: see also Checkout another branch when there are uncommitted changes on the current branch).

  • Running git add copies files from the work-tree, to the index. All commits, including the current commit, are read-only, so there's nothing to overwrite a commit: that's completely impossible.

    The files get stored into the object database, so in case of disaster, you can use git fsck --lost-found to find what Git calls unreachable blob objects. These will have the contents you git added, but the file names will be lost.

  • Other commands (not covered much or at all above) copy files from the index to the work-tree, or from commits, through the index, to the work-tree. The main command for this is git reset.

    It's somewhat difficult to copy a file out of a commit without first writing it into the index. Thus, the commands that copy from commits—including git reset—first overwrite the index version.

  • Conflicted merges use higher-stage number index slots. Thus, merging occurs via the index, too. Except for merging itself, writing into the index collapses away these higher-stage versions, leaving only the normal slot-zero entry.

There are, of course, maintenance / plumbing commands that can set up higher stage index entries, and tools for fussing with the index. We saw git ls-files above for viewing it, and git update-index can change it. But you don't need either of these in everyday Git usage.

like image 45
torek Avatar answered Mar 16 '23 06:03

torek