Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.

It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?

This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.

Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.

like image 607
D'Arcy Rittich Avatar asked Dec 14 '09 20:12

D'Arcy Rittich


People also ask

Do copied files have the same hash?

When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different.

Do all operating systems use the same file system?

Different OS's use different file systems, but all have similar features. The most important part of the file system is the method used to index files on the hard drive. This index allows the OS to know at any given time where to find a specific file on the hard drive. This index is most typically based on file names.

Can two files have same hash?

Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).

What is hashing used for?

Hashing is a cryptographic process that can be used to validate the authenticity and integrity of various types of input. It is widely used in authentication systems to avoid storing plaintext passwords in databases, but is also used to validate files, documents and other types of data.


1 Answers

ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup

Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.

like image 122
FRotthowe Avatar answered Oct 05 '22 12:10

FRotthowe