Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are concurrent operations possible with Git repositories?

Tags:

git

There are two scenarios that I'm interested in.

  • The repository is shared and two users want to push changes to it at the same time
  • I want to schedule a nightly or weekly "gc" using a cron job. It runs and someone wants to push or clone during the operation.

Is there a risk of corruption in either of these scenarios?

like image 474
dromodel Avatar asked Oct 23 '12 21:10

dromodel


2 Answers

Git allows concurrent operations by using a Pessimistic Concurrency Control.

When necessary, git creates some special files to act as locks.

In particular, every time the index is modified by an operation, git creates a file called index.lock in the .git directory to lock the shared resource. Git creates at needs other lock files: for example, a .keep file is created during git index-pack operations.

In general, you shouldn't worry about concurrent operations with git: it is carefully designed to support them.

Someone could tell you shouldn't worry about performing gc with a cron job, since git itself triggers gc from time to time. Even if this is true, the man page itself recommends:

Users are encouraged to run this task on a regular basis 
within each repository to maintain good disk space utilization
and good operating performance.

Hence, I think it's not a bad idea to schedule a job task to run git's garbage collection. I just wonder if it is a premature optimisation or if you are trying to solve a real, measured issue. I personally haven't ever had problems that required me to manually run gc, but I wouldn't be surprised if your case is pretty different.

like image 111
Arialdo Martini Avatar answered Oct 23 '22 13:10

Arialdo Martini


In general, "git gc" may delete objects that another concurrent process is using but hasn't created a reference to.
Git 2.12 (Q1 2017) has more on this.

See commit f1350d0 (15 Nov 2016) by Matt McCutchen (mattmccutchen).
(Merged by Junio C Hamano -- gitster -- in commit 979b82f, 10 Jan 2017)

And see Jeff King's comment:

Modern versions of git do two things to help with this:

  • any object which is referenced by a "recent" object (within the 2 weeks) is also considered recent. So if you create a new commit object that points to a tree, even before you reference the commit that tree is protected

  • when an object write is optimized out because we already have the object, git will update the mtime on the file (loose object or packfile) to freshen it

This isn't perfect, though. You can decide to reference an existing object just as it is being deleted. And the pruning process itself is not atomic (and it's tricky to make it so, just because of what we're promised by the filesystem).

If you have long-running data (like, a temporary index file that might literally sit around for days or weeks) I think that is a potential problem. And the solution is probably to use refs in some way to point to your objects.
If you're worried about a short-term operation where somebody happens to run git-gc concurrently, I agree it's a possible problem, but I suspect something you can ignore in practice.

For a busy multi-user server, I recommend turning off auto-gc entirely, and repacking manually with "-k" to be on the safe side.

This is why git gc man page now includes:

On the other hand, when 'git gc' runs concurrently with another process, there is a risk of it deleting an object that the other process is using but hasn't created a reference to. This may just cause the other process to fail or may corrupt the repository if the other process later adds a reference to the deleted object.

Git has two features that significantly mitigate this problem:

  • Any object with modification time newer than the --prune date is kept, along with everything reachable from it.

  • Most operations that add an object to the database update the modification time of the object if it is already present so that #1 applies.

However, these features fall short of a complete solution, so users who run commands concurrently have to live with some risk of corruption (which seems to be low in practice) unless they turn off automatic garbage collection with 'git config gc.auto 0'.


Note on that last sentence including "unless they turn off automatic garbage": Git 2.22 (Q2 2019) amend the gc documentation.

See commit 0044f77, commit daecbf2, commit 7384504, commit 22d4e3b, commit 080a448, commit 54d56f5, commit d257e0f, commit b6a8d09 (07 Apr 2019), and commit fc559fb, commit cf9cd77, commit b11e856 (22 Mar 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit ac70c53, 25 Apr 2019)

gc docs: remove incorrect reference to gc.auto=0

The chance of a repository being corrupted due to a "gc" has nothing to do with whether or not that "gc" was invoked via "gc --auto", but whether there's other concurrent operations happening.

This is already noted earlier in the paragraph, so there's no reason to suggest this here. The user can infer from the rest of the documentation that "gc" will run automatically unless gc.auto=0 is set, and we shouldn't confuse the issue by implying that "gc --auto" is somehow more prone to produce corruption than a normal "gc".

Well, it is in the sense that a blocking "gc" would stop you from doing anything else in that particular terminal window, but users are likely to have another window, or to be worried about how concurrent "gc" on a server might cause corruption.

like image 26
VonC Avatar answered Oct 23 '22 15:10

VonC