Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between git-lfs and dvc

Tags:

What is the difference between these two? We used git-lfs in my previous job and we are starting to use dvc alongside git in my current one. They both place some kind of index instead of file and can be downloaded on demand. Has dvc some improvements over the former one?

like image 920
Jakub Vonšovský Avatar asked Oct 24 '19 12:10

Jakub Vonšovský


1 Answers

DVC is not better than git-lfs: they are quite different. The selected answer is largely biased. Both are simply different tools, for different purposes.

  • git-lfs is intended to be transparent to git, therefore it requires a customized server. Its learning process is short and fast. Some configuration commands, and bang! it is running, storing large files independently of the git repository. That's its only function, and it does it fine. Having an additional server is not a drawback, but instead a requirement for such transparency. Once configured, files are just handled by git, by means of git hooks (endpoints that are activated after git operations).
  • dvc is intended to provide independent management of large files for the final user. What dvc basically does is this: it just makes git ignore the files that you wish to control (adding them to .gitignore) and instead, it generates an additional file with the same name and the extension .dvc. So, in order to push a commit with its corresponding files, the user is required to manually "add" (equivalent to git commit, not to git add; there's no equivalent for the git stage in dvc) and "push" to both systems. This is not a drawback, but a necessary level of control. In exchange, the remote large-files-holder is just any remote filesystem, accessible directly by its path, via ssh or via multiple drivers (google drive, amazon, etc.). Anyway, hooks are also available for dvc, which would simplify the use of large files, if having additional files is not annoying to one, and saving files to the remote would require additional operations, remember that they are .gitignored! So, if you modify a file stored in dvc, such change will not be noticed by git status, and you might lose such change, except if you make the additional check with dvc.

DVC has a different purpose than git-lfs. DVC is used not only to save large files, but mainly to manage large files that are the result of deterministic processes. So, in addition to storing large files, dvc also controls processing pipelines, like Makefiles do, by defining dependencies in a Makefile, and if the processing inputs (which are also files or parameters tracked by dvc) change, dvc calculates which files must be regenerated (yes, like Makefiles). That's why DVC is usually described as makefile tool for data science. That's cool if you are generating big AI models or heavy data files, in large quantities. The exact equivalent as compiling large applications: every localized change implies just compiling a small portion of the whole.

Personally, I use both for large-file storage. git-lfs simplifies large files management, but dvc simplifies large-file storage (which eases administration), at the cost of not having such transparency, having sometimes lost data. I still don't use dvc for pipelines calculation, until now I've preferred my own implementations. DVC is getting better, perhaps I will use it more in the future. Both are just different; I currently use both, according to the purpose.

like image 132
RodolfoAP Avatar answered Oct 04 '22 13:10

RodolfoAP