Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable (half-million files) version control system

We use SVN for our source-code revision control and are experimenting using it for non-source-code files.

We are working with a large set (300-500k) of short (1-4kB) text files that will be updated on a regular basis and need to version control it. We tried using SVN in flat-file mode and it is struggling to handle the first commit (500k files checked in) taking about 36 hours.

On a daily basis, we need the system to be able to handle 10k modified files per commit transaction in a short time (<5 min).

My questions:

  1. Is SVN the right solution for my purpose. The initial speed seems too slow for practical use.
  2. If Yes, is there a particular svn server implementation that is fast? (We are currently using the gnu/linux default svn server and command line client.)
  3. If No, what are the best f/oss/commercial alternatives

Thanks


Edit 1: I need version control because multiple people will be concurrently modifying the same files and will be doing manual diff/merge/resolve-conflicts in the exact same way as programmers edit source code. Thus I need a central repository to which people can check in their work and check out others work. The work-flow is virtually identical to a programming workflow except that the users are not programmers and the file content is not source-code.


Update 1: Turns out that the primary issue was more of a filesystem issue than an SVN issue. For SVN, committing a single directory with half-million new files did not finish even after 24 hours. Splitting the same across 500 folders arranged in a 1x5x10x10 tree with 1000 files per folder resulted in a commit time of 70 minutes. Commit speed drops significantly over time for single folder with large number of files. Git seems a lot faster. Will update with times.

like image 687
hashable Avatar asked Mar 31 '10 17:03

hashable


3 Answers

As of July 2008, the Linux kernel git repo had about 260,000 files. (2.6.26)

http://linuxator.wordpress.com/2008/07/22/5-things-you-didnt-know-about-linux-kernel-code-metrics/

At that number of files, the kernel developers still say git is really fast. I don't see why it'd be any slower at 500,000 files. Git tracks content, not files.

like image 81
jonescb Avatar answered Oct 14 '22 14:10

jonescb


is SVN suitable? As long as you're not checking out or updating the entire repository, then yes it is.

SVN is quite bad with committing very large numbers of files (especially on Windows) as all those .svn directories are written to to update a lock when you operate on the system. If you have a small number of directories, you won't notice, but the time taken seems to increase exponentially.

However, once committed (in chunks, directory by directory perhaps) then things become very much quicker. Updates don't take so long, and you can use the sparse checkout feature (very recommended) to work on sections of the repository. Assuming you don't need to modify thousands of files, you'll find it works quite well.

Committing 10,000 files - again, all at once is not going to be speedy, but 1,000 files ten times a day will be much more manageable.

So try it once you've got all files in there, and see how it works. All this will be fixed in 1.7, as the working copy mechanism is modified to remove those .svn directories (so keeping locks is simpler and much quicker).

like image 25
gbjbaanb Avatar answered Oct 14 '22 12:10

gbjbaanb


for such short files, i'd check about using a database instead of a filesystem.

like image 34
Javier Avatar answered Oct 14 '22 14:10

Javier