The application's code and configuration files are maintained in a code repository. But sometimes, as a part of the project, I also have a some data (which in some cases can be >100MB, >1GB or so), which is stored in a database. Git does a nice job in handling the code and its changes, but how can the development team easily share the data?
It doesn't really fit in the code version control system, as it is mostly large binary files, and would make pulling updates a nightmare. But it does have to be synchronised with the repository, because some code revisions change the schema (ie migrations).
How do you handle such situations?
We have the data and schema stored in xml and use liquibase to handle the updates to both the schema and the data. The advantage here is that you can diff the files to see what's going on, it plays nicely with any VCS and you can automate it.
Due to the size of your database this would mean a sizable "version 0" file. But, using the migration strategy, after that the updates should be manageable as they would only be deltas. You might be able to convert your existing migrations one-to-one to liquibase as well which might be nicer than a big-bang approach.
You can also leverage @belisarius' strategy if your deltas are very large so each developer doesn't have to apply the delta individually.
It seems to me that your database has a lot of parallels with a binary library dependency: it's large (well, much larger than a reasonable code library!), binary, and has its own versions which must correspond to various versions of your codebase.
With this in mind, why not integrate a dependency manager (e.g. Apache Ivy) with your build process and let it manage your database? This seems like just the sort of task that a dependency manager was built for.
Regarding the sheer size of the data/download, I don't think there's any magic bullet (short of some serious document pre-loading infrastructure) unless you can serialize the data into a delta-able format (the XML/JSON/SQL you mentioned).
A second approach (maybe not so compatible with dependency management): If the specifics of your code allow it, you could keep a second file that is a manual diff that can take a base (version 0) database and bring it up to version X. Every developer will need to keep a clean version 0. A pull (of a version with a changed DB) will consist of: pull diff file, copy version 0 to working database, apply diff file. Note that applying the diff file might take a while for a sizable DB, so you may not be saving as much time over the straight download as it first seems.
We usually use the database sync or replication schema.
Each developer has 2 copies of the database, one for working and the other just for keeping the sync version.
When the code is synchronized, the script syncs the database too (the central DB against the "dead" developer's copy). After that each developer updates his own working copy. Sometimes a developer needs to keep some of his/her data, so these second updates are not always driven by the standard script.
It is as robust as the replication schema .... sometimes (depending on the DB) that doesn't represent good news.
DataGrove is a new product that gives you version control for databases. We allow you to store the entire database (schema and data), tag, restore and share the database at any point in time.
This sounds like what you are looking for.
We're currently working on features to allow git-like (push-pull) behaviors so developers can share their repositories across machines, so I can load the latest version of your database when I need it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With