Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different.
Different OS's use different file systems, but all have similar features. The most important part of the file system is the method used to index files on the hard drive. This index allows the OS to know at any given time where to find a specific file on the hard drive. This index is most typically based on file names.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).
Hashing is a cryptographic process that can be used to validate the authenticity and integrity of various types of input. It is widely used in authentication systems to avoid storing plaintext passwords in databases, but is also used to validate files, documents and other types of data.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With