Does anyone know how the checksum field in active_storage_blobs is calculated when using ActiveStorage on rails 5.2+?
For bonus points, does anyone know how I can get it to use an md5 checksum that would match the one from the md5 CLI command?
I know i'm a bit late to the party, but this is more for those that come across this in a search for answers. So here it is..
Background:
Rails introduced loads of new features in version 5.2, one of which was ActiveStorage. The official final release came out on April 9th, 2018.
Disclaimer:
So to be perfectly clear, the following information pertains to out-of-the-box vanilla active storage. This also doesn't take into account some crazy code-fu that revolves around some one off scenario.
With that said, the checksum is calculated differently depending on your Active Storage setup. With the vanilla out-of-the-box Rails Active Storage, there are 2 "types" (for lack of a better term) of configuration.
File Upload Flow: [Client] → [RoR App] → [Storage Service]
Comm. Flow: Can vary but in most cases it should be similar to File upload flow.
Pointed out above in SparkBao's answer is a "Proxy Upload". Meaning you upload the file to your RoR application and perform some sort of processing before sending the file to your configured storage service (AWS, Azure, Google, BackBlaze, etc...). Even if you set your storage service to "localdisk" the logic still technically applies, even though your RoR application is the storage endpoint.
A "Proxy Upload" approach isn't ideal for RoR applications that are deployed in the cloud on services like Heroku. Heroku has a hardset limit of 30 seconds to complete your transaction and send a response back to your client (end user). So if your file is fairly large, you need to consider the time it takes for your file to upload, and then account for the amount of time to calculate the checksum. If your caught in a scenario where you can't complete the request with a response in the 30 seconds you will need to use the "Direct Upload" approach.
Proxy Uploads Answer:
The Ruby class Digest::MD5 is used in the method compute_checksum_in_chunks(io) as pointed out by Spark.Bao.
File Upload Flow: [Client] → [Storage Service]
Comm. Flow: [Client] → [RoR App] → [Client] → [Storage Service] → [Client] → [RoR App] → [Client]
Our fine friends that maintain and develop Rails have already done all the heavy lifting for us. I won't go into details on how to setup a direct upload, but here is a link on how » Rails EdgeGuide - Direct Uploads.
Proxy Uploads Answer:
Now with all that said, with a vanilla out-of-the-box "Direct Uploads" setup, a file checksum is calculated by leveraging SparkMD5 (JavaScript).
Below is a snippet from the Rails Active Storage Source Code- (activestorage.js)
var fileSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
var FileChecksum = function() {
createClass(FileChecksum, null, [ {
key: "create",
value: function create(file, callback) {
var instance = new FileChecksum(file);
instance.create(callback);
}
} ]);
function FileChecksum(file) {
classCallCheck(this, FileChecksum);
this.file = file;
this.chunkSize = 2097152;
this.chunkCount = Math.ceil(this.file.size / this.chunkSize);
this.chunkIndex = 0;
}
createClass(FileChecksum, [ {
key: "create",
value: function create(callback) {
var _this = this;
this.callback = callback;
this.md5Buffer = new sparkMd5.ArrayBuffer();
this.fileReader = new FileReader();
this.fileReader.addEventListener("load", function(event) {
return _this.fileReaderDidLoad(event);
});
this.fileReader.addEventListener("error", function(event) {
return _this.fileReaderDidError(event);
});
this.readNextChunk();
}
},
If there is anything I missed I do apologize in advance. I tried to be as thorough as possible.
So to Sum things up the following should suffice as an acceptable answer:
Proxy Upload Configuration: The ruby class Digest::MD5
Direct Upload Configuration: The JavaScript hash library SparkMD5.
the source code is here: https://github.com/rails/rails/blob/6aca4a9ce5f0ae8af826945b272842dbc14645b4/activestorage/app/models/active_storage/blob.rb#L369-L377
def compute_checksum_in_chunks(io)
Digest::MD5.new.tap do |checksum|
while chunk = io.read(5.megabytes)
checksum << chunk
end
io.rewind
end.base64digest
end
in my project, I need to use this checksum value to judge whether the user uploads the duplicated file, I use the following code to get the same value with above method:
md5 = Digest::MD5.file(params[:file].tempfile.path).base64digest
puts "========= md5: #{md5}"
the output:
========= md5: F/9Inmc4zdQqpeSS2ZZGug==
database data:
pry(main)> ActiveStorage::Blob.find_by(checksum: 'F/9Inmc4zdQqpeSS2ZZGug==')
ActiveStorage::Blob Load (2.7ms) SELECT "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."checksum" = $1 LIMIT $2 [["checksum", "F/9Inmc4zdQqpeSS2ZZGug=="], ["LIMIT", 1]]
=> #<ActiveStorage::Blob:0x00007f9a16729a90
id: 1,
key: "gpN2NSgfimVP8VwzHwQXs1cB",
filename: "15 Celebrate.mp3",
content_type: "audio/mpeg",
metadata: {"identified"=>true, "analyzed"=>true},
byte_size: 9204528,
checksum: "F/9Inmc4zdQqpeSS2ZZGug==",
created_at: Thu, 29 Nov 2018 01:38:15 UTC +00:00>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With