Suppose I have two branches: master and dev. The first one contains a file named 1.txt
with content
Hello, world
The second one contains file 1.txt
with content
Goodbye, world!!
Where and how git will store different copies of file 1.txt
? I mean, where exactly in .git
folder?
Git doesn't exactly store files. What Git stores are objects.
Branches don't contain files, either. Branch names, like master
or dev
, store a commit hash ID.
The key to understanding this is all a bit circular: you only really understand it when you understand it. :-) But to get started, think about Git as storing commit objects and being centered around the concept of a commit.
A commit is one of these Git objects. There are four kinds of objects: commits, trees, blobs, and tags. Trees and blobs are used to build up a commit. Tag objects are for annotated tags, but don't worry about these yet.
So Git is all about storing commits, and commits wind up holding your files for you (through those tree and blob objects). But a commit isn't the files themselves: it's more of a wrapper. What goes into the commit is: your name (as the author), your email address, and the time you made the commit; the hash ID of the commit's parent commit; your commit log message; and the hash ID of the tree object that remembers which files went into the commit.
So you might think that the tree object holds your files—but it doesn't either! Instead, the tree object holds the names of the files, along with the hash IDs of blob objects. It's these blob objects that hold your files.
The name of a commit, or any other Git object, is written as a 40-character hash ID like d35688db19c9ea97e9e2ce751dc7b47aee21636b
. You have probably seen them, in git log
output for instance, or shortened versions that show up when you run other Git commands.
These hash IDs are impossible for humans to use in any practical way, so Git provides a method of turning a short, meaningful name into a big ugly hash ID. These names come in many forms, but the first one you use is a branch name.
What this means is that if you have two branch names, master
and dev
, these actually store the hash IDs.
Git uses the hash IDs to find commit objects. Each commit object then stores a tree ID. Git uses that to find the tree object. The tree object contains (along with other stuff) a name, like 1.txt
paired with a blob hash ID. Git uses the blob hash ID to find the blob object, and the blob object stores the complete contents of the file.
Where and how git will store different copies of file one? I mean, where exactly in .git folder?
When you run git add 1.txt
and then commit it, Git makes a blob to hold whatever is in 1.txt
. The new blob has some hash ID. Let's say it starts with 1234567...
. Git stores the actual contents in .git/objects/12/34567...
, in a compressed form, along with some up-front bits that identify the object type as a blob.
If you then change 1.txt
and git add
and git commit
again, you get a new blob, with a new ID. Let's say it starts with fedcba9...
. This object goes into .git/objects/fe/dcba9...
.
In order to store these blobs, of course, Git has to write tree objects and commit objects too. If you're on branch dev
, when Git writes out the new commit, Git will change the name dev
to store the new commit hash ID.
In order to find the commit that was on dev
just before all of this, Git writes the new commit with the previous dev
tip commit ID as its parent.
Suppose instead of big ugly hash IDs, we give every commit a single letter, starting from A
and counting up. That's a lot easier to draw, though of course we'd run out of letters after just 26 commits. :-)
Let's start with a repository with just one commit:
A <-- master
The branch name, master
, stores A
so that we know that the commit is named A
.
This isn't very interesting, so let's make a new commit B
:
A <-B <-- master
Now the name master
stores the letter B
. The commit itself, the B
object, has inside it the ID of commit A
.
To make another new commit on master
, we assign it a new hash C
, write a commit object with the appropriate log message and tree and so on, and make C
's parent be B
:
A <-B <-C
and then we write C
into master
:
A <-B <-C <-- master
What this means is that branch names, like master
, simply point to the tip commit of the branch. The branch itself is, in a sense, the chain of commits starting from the latest and working backwards.
Note that Git's internal arrows all point backwards. Git runs everything backwards, all the time, starting from the latest.
We can make this more interesting by creating a new branch dev
. Initially, dev
points to the same commit as master
:
A--B--C <-- dev (HEAD), master
We've added this funny notation, (HEAD)
, to remember which branch name we're using.
Now let's make a new commit as usual. The new commit gets its author and log message as always, and stores the current commit's hash ID which is C
as its parent, but now we have to update a branch name to point to D
. Which branch name should we update? That's where HEAD
comes in: it tells us which one to update!
A--B--C <-- master
\
D <-- dev (HEAD)
So now dev
identifies commit D
, while master
still identifies C
.
This is the first main secret to understanding Git. Git doesn't store files, it stores commits. The commits form into chains. These chains are the history in the Git repository.
Git uses branch names to remember the latest or tip commits. These tip commits let us find the older commits. If we add a new commit E
to master
we get:
A--B--C--E <-- master
\
D <-- dev
and we can now see, visually, that master
and dev
join up at commit C
.
Running git checkout <branch-name>
tells Git to extract the commit at the tip of the branch, using the commit to find the tree to find the blobs to get all the files. Then, as the last step of git checkout
of a branch name, Git attaches HEAD
to that branch name, so that it knows which branch name to update when we add new commits.
A branch is a text file which that contains a hash of a commit.
It is part of the Git references — a group of objects that reference a commit.
Git stores all references under the .git/refs folder and branches are stored in the directory .git/refs/heads.
Since branch is a simple text file we can just create a file with the contents of a commit hash.
Torek has an excellent answer that I'm not going to try and replicate... but if it's still confusing to you, then let me try to demonstrate how it works with Javascript. I'm going to simplify things a bit, so this isn't an exact implementation of Git in JS, but it's close enough to understand some of the fundamentals.
A file is made up of two distinct parts: the actual contents of the file; and the metadata about that file (it's name and mode). Let's define the contents and store them so that we can reference them later:
allTheThings['06f19763'] = "blob " + "Hello, world";
The variable names here are the SHA1 hashes of the values. This is a really important concept going forward... everything in git is a SHA1 hash of something. You can generate these hashes yourself using any SHA1 tool you want (I used an online tool).
I truncated the hash value to the first 8 characters for brevity. When working in git, you can truncate as much as you want so long as git is still able to uniquely identify an object. Usually 8 characters is enough (the odds of two objects having the same first 8 commits are really, really small), so that's what you'll see in most examples and even in much of the documentation.
Cool... so now we've got the contents. But we want the other half of the file now... it's name. To do that, we need to create a tree object that basically replicates a folder/directory.
allTheThings['5e91b67a'] = "tree " + "100644 blob 06f19763 file1.txt";
This tree object says that the file contents referenced by 06f19763
(or "Hello, world") are named file1.txt
and are read/writable (the 100644
is based on Unix modes -- this one means the file1.txt is a normal file).
In addition to files, trees can contain other trees, which is how we can create directories of arbitrary depth.
Each commit contains a reference to a tree, representing the root directory of the repo. In our example, file1.txt is located in the root and is the only file in the repo. So let's create a commit:
allTheThings['a9d13be8'] =
"commit\n" +
"tree 5e91b67a\n" +
"author JD <email> 1508777071\n" +
"committer JD <email> 1508777071\n" +
"\n" +
"Commit message";
The commit points to our tree, and includes some additional info like author of the commit and a commit message.
A branch is pretty much just a convenient name for a commit. When you update a branch, you're just creating a new commit then resetting the branch to point to it.
All the things we've created so far are stored in the allTheThings
object, so they're all stored together. We can tell what everything is based on the prefixes ("blob", "tree" and "commit"). Every entry is keyed off the Hash of the contents, which is virtually guaranteed to be unique. Whenever we change the contents of a file, a file name, a commit message, etc, we change the hash, but the original object is still there and can still be referenced by other objects (trees, commits, etc).
For example, if we update the file we end up with new hash ids up the entire chain:
allTheThings['3e103e35'] = "blob " + "Goodbye, world!!";
allTheThings['05abc8ab'] = "tree " + "100644 blob 3e103e35 file1.txt";
allTheThings['a5944bfa'] =
"commit\n" +
"tree 05abc8ab\n" +
"author JD <email> 1508777071\n" +
"committer JD <email> 1508777071\n" +
"\n" +
"Commit message";
Notice how, even though the file name and commit message/author/etc did not change, the change to the contents of file1 caused a chain reaction the entire way up to the commit:
06f19763 => 3e103e35 (the contents changed...)
5e91b67a => 05abc8ab (so the content reference in the tree changed)
a9d13be8 => a5944bfa (so the tree reference in the commit changed )
All six objects exist in our allTheThings
object, happily living right next to each other:
allTheThings = {
06f19763: "blob Hello, world",
3e103e35: "blob Goodbye, world!!",
5e91b67a: "tree 100644 blob 06f19763 file1.txt",
05abc8ab: "tree 100644 blob 3e103e35 file1.txt",
a9d13be8: "commit\ntree 5e91b67a\nauthor JD <email> 1508777071\ncommitter JD <email> 1508777071\n\nCommit message",
a5944bfa: "commit\ntree 05abc8ab\nauthor JD <email> 1508777071\ncommitter JD <email> 1508777071\n\nCommit message",
}
Finally, your master
branch points to a9d13be8
, while your dev
branch points to a5944bfa
.
In real git, these objects are stored in the .git
directory as individual (compressed) files (.git/objects/12/34567...
as Torek said), but it's the same concept.
Because a git repo can contain so many objects, the leading two characters of the hash are used to subdivide files into directories, to ensure the the maximum file count in a directory isn't exceeded (especially on older systems). It's tempting to think that these prefixes have more meaning than that, such as object type, but they don't.
And that's pretty much it. Files, trees, commits, and a few other things, are all considered Git Objects and are lumped together inside the objects directory. You can use plumbing commands to work directly with these objects and extract them for use, but it's almost always much easier to use the many porcelain commands to work with them indirectly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With