I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?
You should be safe (at least data is not gone) most likely. From the dvc remove docs:
Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run
dvc pushto save the files you actually want to use or share in the future.
So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.
There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).
First of all, here is the document that describes briefly how DVC stores directories in the cache.
What we can do is to find all .dir files in the .dvc/cache:
find .dvc/cache -type f -name "*.dir"
outputs something like:
.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir
(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)
Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):
Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:
[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]
and we want to get a directory with train.tsv).
The only thing we need to do is to create a .dvc file that references this directory:
outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
path: my-directory
(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)
And run dvc pull on this file.
That should be it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With