Large Files in Source Control (TFS)

Tags:

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.

So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?

561

asked Dec 12 '11 15:12

A.R.

2 Answers

If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.

You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications.

A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.

answered Sep 18 '22 13:09

pantelif

tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.

Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.

Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.

Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.

If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)

One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.

Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.

Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.

152

answered Sep 19 '22 13:09

Edward Thomson

Related questions
                            
                                Include CocoaPods in version control checkin?
                            
                                Error: Retrieval of mergeinfo unsupported by 'svn://IP.Address'?
                            
                                SVN Reintegrate same branch to trunk multiple times
                            
                                Restrict certain GitHub users to merge branches
                            
                                How to Tag a single file in GIT
                            
                                What is the "reset" command for a git cloned repository?
                            
                                Uncompress OpenOffice files for better storage in version control
                            
                                Expected FS format '2' found format '3 git-svn
                            
                                Perform an empty commit with mercurial
                            
                                Removing branch mapping in Team Foundation Server 2010
                            
                                BFG Repo-Cleaner states my github repo is not a valid Git repository
                            
                                Revert only a single file of a pushed commit
                            
                                How do you organize your git repositories?
                            
                                Git: is there a functionality like the TFS shelveset?
                            
                                Version control for Adobe Flash projects
                            
                                How can I split a past un-pushed commit with Sourcetree?
                            
                                Merging a feature branch that is based off another feature branch
                            
                                Getting Git to follow renamed and edited files
                            
                                How to switch from Ankhsvn plugin to VisualHG in Visual Studio 2010
                            
                                Is there any distributed revision control system that supports partial checkout/clone?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With