Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Git branch size

Tags:

git

I'm trying to track the size of a project I'm working on. Is there an easy way to get the repository size on disk for different branches?

I tried

git count-objects -v

But it gives the same repository size for each branch.

like image 332
avellable Avatar asked Sep 14 '15 05:09

avellable


3 Answers

With Git 2.31 (Q1 2021), "git rev-list"(man) command learned --disk-usage option.

It has a lot of examples, but regarding branch size, the command now is:

git rev-list --disk-usage --objects HEAD..<branch_name>

For all branches:

/* Report the disk size of each branch, not including objects used by the
  current branch. This can find outliers that are contributing to a
  bloated repository size (e.g., because somebody accidentally committed
  large build artifacts).
*/

git for-each-ref --format='%(refname)' |
while read branch
do
    size=$(git rev-list --disk-usage --objects HEAD..$branch)
    echo "$size $branch"
done |
sort -n

See commit a1db097, commit 669b458 (17 Feb 2021), and commit 16950f8, commit 3803a3a (09 Feb 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 6fe12b5, 25 Feb 2021)

rev-list: add --disk-usage option for calculating disk usage

Signed-off-by: Jeff King

It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone).

You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like:

git rev-list --objects --no-object-names main..branch
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'

Though note that the caveats from git-cat-file(1) apply here.
We "blame" base objects more than their deltas, even though the relationship could easily be flipped.
Still, it can be a useful rough measure.

But one problem is that it's slow to run.
Teaching rev-list to sum up the sizes can be much faster for two reasons:

  1. It skips all of the piping of object names and sizes.
  2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size.

This patch implements a --disk-usage option which produces the same answer in a fraction of the time.
Here are some timings using a clone of torvalds/linux:

[rev-list piped to cat-file, no bitmaps]
$ time git rev-list --objects --no-object-names --all |
  git cat-file --buffer --batch-check='%(objectsize:disk)' |
  perl -lne '$total += $_; END { print $total }'
1459938510
real  0m29.635s
user  0m38.003s
sys   0m1.093s

[internal, no bitmaps]
$ time git rev-list --disk-usage --objects --all
1459938510
real  0m31.262s
user  0m30.885s
sys   0m0.376s

Even though the wall-clock time is slightly worse due to parallelism, notice the CPU savings between the two.
We saved 21% of the CPU just by avoiding the pipes.

But the real win is with bitmaps.
If we use them without the new option:

[rev-list piped to cat-file, bitmaps]
$ time git rev-list --objects --no-object-names --all --use-bitmap-index |
  git cat-file --batch-check='%(objectsize:disk)' |
  perl -lne '$total += $_; END { print $total }'
1459938510
real  0m6.244s
user  0m8.452s
sys   0m0.311s

then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up.
But if we do both together:

[internal, bitmaps]
$ time git rev-list --disk-usage --objects --all --use-bitmap-index
1459938510
real  0m0.219s
user  0m0.169s
sys   0m0.049s

then we get the same answer much faster.

For "--all", that answer will correspond closely to "du objects/pack", of course.
But we're actually checking reachability here, so we're still fast when we ask for more interesting things:

$ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
374798628
real  0m0.429s
user  0m0.356s
sys   0m0.072s

rev-list-options now includes in its man page:

--disk-usage

Suppress normal output; instead, print the sum of the bytes used for on-disk storage by the selected commits or objects. This is equivalent to piping the output into git cat-file --batch-check='%(objectsize:disk)', except that it runs much faster (especially with --use-bitmap-index). See the CAVEATS section in git cat-file for the limitations of what "on-disk storage" means.


With Git 2.38 (Q3 2022), "git rev-list --disk-usage"(man) learned to take an optional value human to show the reported value in human-readable format, like "3.40MiB".

See commit 9096451 (11 Aug 2022) by Li Linchao (Cactusinhand).
(Merged by Junio C Hamano -- gitster -- in commit fddd8b4, 18 Aug 2022)

rev-list: support human-readable output for --disk-usage

Signed-off-by: Li Linchao

The '--disk-usage' option for git-rev-list(man) was introduced in 16950f8 ("rev-list: add(man)--disk-usage option for calculating disk usage", 2021-02-09, Git v2.31.0-rc0 -- merge).

This is very useful for people inspect their git repository objects usage information, but the resulting number is quit hard for a human to read.

Teach git rev-list to output a human readable result when using '--disk-usage=human'.

rev-list-options now includes in its man page:

With the optional value human, on-disk storage size is shown in human-readable string (e.g. 12.24 Kib, 3.50 Mib).

like image 137
VonC Avatar answered Oct 23 '22 15:10

VonC


Here's something really ugly:

$ git rev-list HEAD |                     # list commits
  xargs -n1 git ls-tree -rl |             # expand their trees
  sed -e 's/[^ ]* [^ ]* \(.*\)\t.*/\1/' | # keep only sha-1 and size
  sort -u |                               # eliminate duplicates
  awk '{ sum += $2 } END { print sum }'   # add up the sizes in bytes

This will only count the blobs (not commits, trees, other), and will not account for either packing or cross-branch object sharing. But it could serve as the basis for something useful.

Paste-able version:

git rev-list HEAD | xargs -n1 git ls-tree -rl | sed -e 's/[^ ]* [^ ]* \(.*\)\t.*/\1/' | sort -u | awk '{ sum += $2 } END { print sum }'
like image 20
JB. Avatar answered Oct 23 '22 16:10

JB.


This question doesn't really make sense -- in git, branches are not stored separately. Instead, there is a web of commits, and basically just the diffs are stored. The branches are just pointers to specific commits in this web of commits. So in general branches share a lot of the same information.

If you want to know the size in disk-space of a single branch, meaning, the minimal amount of disk space someone will need if they clone the repo taking only that branch, the simplest thing is probably to make a repo just like that, and then ask for the size of that repo.

like image 6
Chris Beck Avatar answered Oct 23 '22 15:10

Chris Beck