We use SVN for our source-code revision control and are experimenting using it for non-source-code files.
We are working with a large set (300-500k) of short (1-4kB) text files that will be updated on a regular basis and need to version control it. We tried using SVN in flat-file mode and it is struggling to handle the first commit (500k files checked in) taking about 36 hours.
On a daily basis, we need the system to be able to handle 10k modified files per commit transaction in a short time (<5 min).
My questions:
Thanks
Edit 1: I need version control because multiple people will be concurrently modifying the same files and will be doing manual diff/merge/resolve-conflicts in the exact same way as programmers edit source code. Thus I need a central repository to which people can check in their work and check out others work. The work-flow is virtually identical to a programming workflow except that the users are not programmers and the file content is not source-code.
Update 1: Turns out that the primary issue was more of a filesystem issue than an SVN issue. For SVN, committing a single directory with half-million new files did not finish even after 24 hours. Splitting the same across 500 folders arranged in a 1x5x10x10 tree with 1000 files per folder resulted in a commit time of 70 minutes. Commit speed drops significantly over time for single folder with large number of files. Git seems a lot faster. Will update with times.
As of July 2008, the Linux kernel git repo had about 260,000 files. (2.6.26)
http://linuxator.wordpress.com/2008/07/22/5-things-you-didnt-know-about-linux-kernel-code-metrics/
At that number of files, the kernel developers still say git is really fast. I don't see why it'd be any slower at 500,000 files. Git tracks content, not files.
is SVN suitable? As long as you're not checking out or updating the entire repository, then yes it is.
SVN is quite bad with committing very large numbers of files (especially on Windows) as all those .svn directories are written to to update a lock when you operate on the system. If you have a small number of directories, you won't notice, but the time taken seems to increase exponentially.
However, once committed (in chunks, directory by directory perhaps) then things become very much quicker. Updates don't take so long, and you can use the sparse checkout feature (very recommended) to work on sections of the repository. Assuming you don't need to modify thousands of files, you'll find it works quite well.
Committing 10,000 files - again, all at once is not going to be speedy, but 1,000 files ten times a day will be much more manageable.
So try it once you've got all files in there, and see how it works. All this will be fixed in 1.7, as the working copy mechanism is modified to remove those .svn directories (so keeping locks is simpler and much quicker).
for such short files, i'd check about using a database instead of a filesystem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With