Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run
command.
But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.
Can somebody clarify which is the better practice?
Writing the dvc.yaml or let it be generated by dvc run
command?
Or is it left to user's choice and there is no technical difference?
I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)
dvc stage add
can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml
, for example setting vars
values or defining foreach
stages.
Both, really.
Primarily dvc run
(or the newer dvc stage add
followed by dvc exp run
) is meant to mange your dvc.yaml
file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config
and directly modifying .{git,dvc}/config
)
However as you note, dvc.yaml
is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach
stages).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With