What is the difference between these two? We used git-lfs in my previous job and we are starting to use dvc alongside git in my current one. They both place some kind of index instead of file and can be downloaded on demand. Has dvc some improvements over the former one?
DVC is not better than git-lfs: they are quite different. The selected answer is largely biased. Both are simply different tools, for different purposes.
.gitignore
) and instead, it generates an additional file with the same name and the extension .dvc
. So, in order to push a commit with its corresponding files, the user is required to manually "add" (equivalent to git commit, not to git add; there's no equivalent for the git stage in dvc) and "push" to both systems. This is not a drawback, but a necessary level of control. In exchange, the remote large-files-holder is just any remote filesystem, accessible directly by its path, via ssh or via multiple drivers (google drive, amazon, etc.). Anyway, hooks are also available for dvc, which would simplify the use of large files, if having additional files is not annoying to one, and saving files to the remote would require additional operations, remember that they are .gitignored! So, if you modify a file stored in dvc, such change will not be noticed by git status
, and you might lose such change, except if you make the additional check with dvc.DVC has a different purpose than git-lfs. DVC is used not only to save large files, but mainly to manage large files that are the result of deterministic processes. So, in addition to storing large files, dvc also controls processing pipelines, like Makefile
s do, by defining dependencies in a Makefile
, and if the processing inputs (which are also files or parameters tracked by dvc) change, dvc calculates which files must be regenerated (yes, like Makefile
s). That's why DVC is usually described as makefile tool for data science. That's cool if you are generating big AI models or heavy data files, in large quantities. The exact equivalent as compiling large applications: every localized change implies just compiling a small portion of the whole.
Personally, I use both for large-file storage. git-lfs simplifies large files management, but dvc simplifies large-file storage (which eases administration), at the cost of not having such transparency, having sometimes lost data. I still don't use dvc for pipelines calculation, until now I've preferred my own implementations. DVC is getting better, perhaps I will use it more in the future. Both are just different; I currently use both, according to the purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With