I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model.
The data changes over time:
So basically "everything" might change: amount of observations, Features and the values. We are interested in making the ml model Building reproducible without using 10/100+ GB of disk volume, because we save all updated versions of Input data. Currently the volume size of the data is ~700mb.
The most promising tool i found is: https://github.com/iterative/dvc. Currently the data is stored in a database in loaded in R/Python from there.
Question:
How much disk volume can be (very approx.) saved by using dvc?
If one can roughly estimate that. I tried to find out if only the "diffs" of the data are saved. I didnt find much info by reading through: https://github.com/iterative/dvc#how-dvc-works or other documentation.
I am aware that this is a very vague question. And it will highly depend on the dataset. However, i would still be interested in getting a very approximate idea.
Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.
DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.
I will use dvc add
as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add
, dvc run
, etc.
Let's imagine I have a single 1GB XML file. I start tracking it with DVC:
$ dvc add data.xml
On the modern file system (or if hardlinks
, symlinks
are enabled, see this for more details) after this command we still consume 1GB (even though file is moved into DVC cache and is still present in the workspace).
Now, let's change it a bit and save it again:
$ echo "<test/>" >> data.xml
$ dvc add data.xml
In this case we will have 2GB consumed. DVC does not do diff between two versions of the same file, neither it splits files into chunks or blocks to understand that only small portion of data has changed.
To be precise, it calculates
md5
of each file and save it in the content addressable key-value storage.md5
of the files serves as a key (path of the file in cache) and value is the file itself:(.env) [ivan@ivan ~/Projects/test]$ md5 data.xml 0c12dce03223117e423606e92650192c (.env) [ivan@ivan ~/Projects/test]$ tree .dvc/cache .dvc/cache └── 0c └── 12dce03223117e423606e92650192c 1 directory, 1 file (.env) [ivan@ivan ~/Projects/test]$ ls -lh data.xml data.xml ----> .dvc/cache/0c/12dce03223117e423606e92650192c (some type of link)
Let's now imagine we have a single large 1GB directory images
with a lot of files:
$ du -hs images
1GB
$ ls -l images | wc -l
1001
$ dvc add images
At this point we still consume 1GB. Nothing has changed. But if we modify the directory by adding more files (or removing some of them):
$ cp /tmp/new-image.png images
$ ls -l images | wc -l
1002
$ dvc add images
In this case, after saving the new version we still close to 1GB consumption. DVC calculates diff on the directory level. It won't be saving all the files that were existing before in the directory.
The same logic applies to all commands that save data files or directories into DVC cache - dvc add
, dvc run
, etc.
Please, let me know if it's clear or we need to add more details, clarifications.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With