Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.

I'm wondering what approaches people have use to accomplish this.

like image 575
ebeland Avatar asked Sep 14 '11 01:09

ebeland


1 Answers

You could probably roll your own solution to do this. Something along the lines of:

To upload a file:

  1. Hash the file first, using SHA-1 or stronger.
  2. Use the hash to name the file. Do not use the actual file name.
  3. Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.

To upload subsequent files:

  1. Calculate the hash, and only upload the data blob file if it doesn't already exist.
  2. Save the directory entry with the hash as the content, like for all files.

To read a file:

  1. Open the file from the virtual file system to discover the hash, and then get the actual file using that information.

You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.

like image 64
Nikhil Dabas Avatar answered Jan 04 '23 18:01

Nikhil Dabas