I'm currently programming a video sharing site. For the uploads, I'm using PHP. I noticed that when I upload a video, S3 takes an MD5 hash of the file. I'm wondering if S3 does any deduplication. I uploaded several of the same file and didn't see anything in there indicating S3 knew they were the same or that it was doing anything about it at least.
Should I implement this on my own? I have a MySQL database where I'm storing all the video info. I could take a hash of all the videos and serve up previously uploaded files if they're the same. I could simply do md5(tmp-file-here)
. This would seem appropriate since S3 is already using md5. However, md5 is slow when compared to something optimized for such things such as BLAKE2. Should I do this and what would be the best approach?
S3 does not expose any evidence of internal deduplication. If you were to upload 500 identical files of 1 GB each, you'd be billed for storing 500 GB.
So, if you want to consider deduplicating uploaded content, you will need to roll your own solution, but consider these points:
The standard md5 hash algorithm is not the only algorithm S3 uses on ETags. It also uses a nested md5 algorithm for multipart uploads, which is required for uploads > 5 GB and optional for smaller files, and two identical files uploaded as a different number of parts will not have the same ETag. (In HTTP, the scope of an ETag is a single resource, and it only has a one-way constraint: if a resource changes, its ETag must change, but differing ETags don't necessarily communicate any information. S3 is mote stringent than that, but the ETag is not a perfect deduplication key).
Importantly, though, MD5 is not sufficient or adequate for deduplication. MD5 is now considered broken for most purposes, because collisions can be engineered. MD5 is only really valid for one remaining purpose: validating that a blob of data has not been accidentally corrupted from a prior known MD5 hash of the blob. It is of little value for determining whether a blob of data has been deliberately corrupted. The odds against accidental corruption resulting in the same MD5 hash are astronomically low, but deliberate collisions can be engineered. SHA-1 has also been proven vulnerable in practice.
Since you are storing object locations in a database, you have the flexibility of not needing to address this issue, right away. The low cost of S3 storage (~$23/TB/month) is such that it is unlikely that you will find this a worthwhile pursuit, at least for a while, and if you do, then you can use whatever algorithm makes sense when you decide you need it -- scan the objects looking for objects of identical size, then compare those objects to see if they are indeed identical, and update the database, accordingly, cleaning up the dupes.
Another option -- one that I have used with success -- is to use bucket versioning and actually store the objects with keys based on the SHA-256 of their content. If you overwrite an object and versioning is enabled, you still have access to all the different versions of the object, but anyone downloading the object without a version-id specified will receive the most recent upload. You can purge those old objects periodically if needed, after taking steps (using a different algorithm) to ensure that you haven't found two different objects with a SHA-256 collision. (If you do find differing objects with a SHA-256 collision, you'll be famous.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With