Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explanation of Github fork and how they store files

I am just wondering what happens when a fork is done on github.

For example, when I fork a project does it make a copy on github server of all of that code, or just create a link to it?

So another question: In git since it hashes all the files if you add the same file to it it does not need to store the file contents again because the hash will be already in the system, correct?

Is github like this? So if I happen to upload the exact same piece of code as another user, when github gits it does it essentially just create a link to that file since it would have the same hash, or does it save all of its contents again separately?

Any enlightenment would be great, thanks!

like image 690
Jonovono Avatar asked Aug 15 '12 18:08

Jonovono


People also ask

How does a GitHub fork work?

A fork is a copy of a repository that you manage. Forks let you make changes to a project without affecting the original repository. You can fetch updates from or submit changes to the original repository with pull requests.

What is Git fork and how it is used?

A fork in Git is simply a copy of an existing repository in which the new owner disconnects the codebase from previous committers. A fork often occurs when a developer becomes dissatisfied or disillusioned with the direction of a project and wants to detach their work from that of the original project.

How does GitHub store data?

Github uses Git which can be seen as an object data storage. In this storage, files and directories are stored as git trees and blobs. You may want to read about git internal to understand its architecture.


2 Answers

According to https://enterprise.github.com/releases/2.2.0/notes GitHub Enterprise (and I assume GitHub) somehow shares objects between forks to reduce disk space usage:

This release changes the way GitHub Enterprise stores repositories, which reduces disk usage by sharing Git objects between forks and improves caching performance when reading repository data.

There's also more details about how they do it at https://githubengineering.com/counting-objects.

like image 76
bbodenmiller Avatar answered Oct 04 '22 19:10

bbodenmiller


github.com is exactly the same semantics as git, but with a web-based GUI interface wrapped around it.

Storage: "Git stores each revision of a file as a unique blob object"
So each file is stored uniquely, but it uses a SHA-1 hash to determine changes from file to file.

As for github, a fork is essentially a clone. This means that a new fork is a new area of storage on their servers, with a reference to its ORIGIN. It in no way would set up links between the two, because git by nature can track remotes. Each fork knows the upstream.

When you say "if I happen to upload the exact same piece of code as another user", the term "upload" is a bit vague in the "git" sense. If you are working on the same repository and git even allows you to commit the same file, that means it was different and it checked in that revision. But if you mean working on a clone/fork of another repo, it would be the same situation, but also there would be no links made on the filesystem to the other repo.

I can't claim to have any intimate knowledge of what optimizations github might be making under the hood, on their internal system. They could possibly be doing intermediate custom operations to save on disk space. But anything they would be doing would be transparent to you and would not matter much, since effectively it should always operate under expected git semantics.

A developer at github wrote a blog post about how they internally do their own git workflow. While it doesn't relate to your question about how they manage the actual workflow of the service, I think this quote from the conclusion is pretty informative:

Git itself is fairly complex to understand, making the workflow that you use with it more complex than necessary is simply adding more mental overhead to everybody’s day. I would always advocate using the simplest possible system that will work for your team and doing so until it doesn’t work anymore and then adding complexity only as absolutely needed.

What I take away from that, is they acknowledge how complex git is by itself, so most likely they take the lightest touch possible to wrap around it to provide the service, and let git do what it does best natively.

like image 20
jdi Avatar answered Oct 04 '22 17:10

jdi