Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.

So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?

like image 561
A.R. Avatar asked Dec 12 '11 15:12

A.R.


People also ask

What are large files in Git?

Git LFS (Large File Storage) is a Git extension developed by Atlassian, GitHub, and a few other open source contributors, that reduces the impact of large files in your repository by downloading the relevant versions of them lazily.

Can Git handle large files?

Can Git Handle Large Files? Git cannot handle large files on its own. That's why many Git teams add Git LFS to deal with large files in Git.

What if a file is too big for GitHub?

GitHub blocks files larger than 100 MB. To track files beyond this limit, you must use Git Large File Storage (Git LFS). For more information, see "About Git Large File Storage." If you need to distribute large files within your repository, you can create releases on GitHub.com instead of tracking the files.


2 Answers

If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.

You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications.

A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.

like image 32
pantelif Avatar answered Sep 18 '22 13:09

pantelif


tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.

Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.

Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.

Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.

If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)

One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.

Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.

Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.

like image 152
Edward Thomson Avatar answered Sep 19 '22 13:09

Edward Thomson