Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is dvc.yaml supposed to be written or generated by dvc run command?

Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.

But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.

Can somebody clarify which is the better practice? Writing the dvc.yaml or let it be generated by dvc run command? Or is it left to user's choice and there is no technical difference?

like image 255
rajeshnair Avatar asked Jan 25 '23 08:01

rajeshnair


2 Answers

I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)

dvc stage add can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml, for example setting vars values or defining foreach stages.

like image 167
Jorge Orpinel Pérez Avatar answered Feb 09 '23 01:02

Jorge Orpinel Pérez


Both, really.

Primarily dvc run (or the newer dvc stage add followed by dvc exp run) is meant to mange your dvc.yaml file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config and directly modifying .{git,dvc}/config)

However as you note, dvc.yaml is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach stages).

like image 37
casper.dcl Avatar answered Feb 09 '23 00:02

casper.dcl