Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does a Mercurial repository grow over time?

Tags:

dvcs

mercurial

Let's say I create a repository, add x files to it and commit. Say the size is a Mb after the initial commit.

  • Is there any way to estimate how large the repository is going to be in one years time?

  • If the lines of code has increased by 10%, will the repository have grown accordingly?

  • How does number of commits, branches, tags etc. factor into the repository size?

  • Will 10000 commits the same year make the repository grow (noticeably) more than say 1000 commits?

  • Maybe my question is wrongly phrased?

like image 711
MdaG Avatar asked Oct 29 '10 14:10

MdaG


3 Answers

Changes to a Mercurial repository are stored as either a complete file or as a compressed delta against the previous version:

https://www.mercurial-scm.org/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F

Mercurial makes the decision about whether to store a complete file versus a delta based on the amount of changes made.

This means that it's not just adding lines of code that will increase the total size of a repository, but also:

  1. The number of changes made to existing code.
  2. The number of changes made to each file per commit.
  3. The number of files that are added and subsequently deleted.

Mercurial retains all deleted files. You could add a 1GB file to your repository and then delete it; the number of lines hasn't increased, but because the file remains in the repository, the repository will be considerably larger.

To answer your questions in turn:

  • I imagine it's feasible to roughly estimate the size of a repository after x months, assuming that you maintain a steady rate of change to the repository in total (ie. you add/remove/alter files at the same rate, changing roughly the same number of lines per commit).

  • Increasing the number of lines of code by 10% doesn't tell us how many lines were deleted/altered, so an increase in lines of code won't necessarily correspond to the same increase in repo size.

  • Tags don't affect Mercurial repo size by more than a handful of bytes. Nor do branches, until you start working on them, at which point they add the same overhead as working on the tip. Number of commits should be reasonably proportional to the repo size, assuming the same rate of change occurs.

  • Committing 10x as often probably won't increase the file size, as it is the rate of change that is the main influence on repo size, not number of commits.

like image 76
Ant Avatar answered Oct 21 '22 01:10

Ant


Directly estimating the size in a year is obviously impossible, unless you have some idea of the number of commits and the final size of the work tree.

That said, git is pretty disk-space efficient. It absolutely never stores more than one copy of a given version of a file (this is internally represented as a blob), and older blobs are delta-compressed into packs. This means that it is very efficient at storing plain text, and very inefficient with large binary files. If your project is predominantly plain text, you almost certainly have nothing to worry about.

Branches and tags have essentially no effect on size. Sure, a branch's reflog could get up to a few KB, but that's nothing to worry about. Lightweight tags are pretty much just a stored SHA1, and annotated tags just add a tiny bit of metadata to that.

As for lines of code and number of commits, it's hard to say exactly. Generally the commits are a much bigger factor than the lines of code; you can have many many version of files all adding up (even represented as deltas) but the actual content only has to be stored once. This is backed up by the fact that work trees tend to be much than the .git directory. For example, my clone of git.git has a 17MB work tree and a 39MB .git directory. Other projects I examined had similar ratios.

More commits of equal size would certainly make the repository grow more, but taking 1000 commits and splitting them up into 10000 (encompassing the same changes) wouldn't make the repository much bigger. The commit objects themselves are small; it's the differences in the files that take space. You might see an initial spike in size, as commits are only periodically delta-compressed, but once git gc --auto gets triggered, those commits will get compressed back down.

The best generalization I can make is that a repository's .git directory will tend to grow at a rate proportional to the amount of delta per time, which in general should be proportional to work tree size and the rate at which people are modifying the project. This is of course so general as to be completely unhelpful, but there you are.

If you want to estimate, I'd just take some data over the first month or so, and try and fit a curve.

like image 23
Cascabel Avatar answered Oct 21 '22 01:10

Cascabel


Take a look at GitBenchmarks page on Git wiki, the section "Repository size benchmarks" and "Other benchmarks and references" (taking into account when the benchmark was made, and what versions it uses), in particular the entry at the end page:

  • DVCS Round-up: One System to Rule Them All? -- Part 3 by Robert Fendt on Linux Developer Network, from 27-01-2009, contains results of two synthetic benchmarks testing how a system acts under stress (number of commits in repository, or number of files comitted).

    The test system was a VM running Ubuntu 8.10, and the software versions used were SVK 2.0.2 (last is 2.2.3), darcs 2.1.0 (last is 2.4.4), monotone 0.42 (last is 0.48), Bazaar 1.10 (last is 2.2.1), Mercurial 1.1.2 (last is 1.6.4), and Git 1.6.1 (last is 1.7.3).

like image 27
Jakub Narębski Avatar answered Oct 21 '22 00:10

Jakub Narębski