I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models.
One idea I can think of is using separate git submodules for data and models. But that feels cumbersome and imposes some additional requirements from the end user (e.g. having to do git submodule update
).
So can I do this without using git submodules?
You can first add the different DVC remotes you want to establish (let's say you call them data
and models
, each one pointing to a different GC bucket). But don't set any remote as the project's default; This way, dvc push
won't work without the -r
(or --remote
) option.
You would then need to push each directory or file individually to the appropriate remote, like dvc push data/ -r data
and dvc push model.dat -r models
.
Note that a feature request to configure this exists on the DVC repo too. See Specify file types that can be pushed to remote.
Yes, you can use multiple remotes without Git-submodules.
There is a separate command for using data artifacts from external repositories: dvc import http://your-repo datadir
The command brings data to your repo and keeps the connection to the original repo (to avoid data duplication in different remotes).
In your case, one repository can be used for a dataset with its own data remote. A second repo might be used for the code and models which imports the dataset project while all it's models and outputs go to another data remote.
With import
, no dvc push -r myremote
are needed. A default dvc push
synchronize data in a proper remote.
EDITED: Simply use one Git repo for dataset with its data-remote/S3-folder, and import it from another repo with code, model and another data-remote/S3-folder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With