Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is the checksum calculated in the blobs table for rails ActiveStorage

Does anyone know how the checksum field in active_storage_blobs is calculated when using ActiveStorage on rails 5.2+?

For bonus points, does anyone know how I can get it to use an md5 checksum that would match the one from the md5 CLI command?

like image 979
mvh Avatar asked Jun 02 '18 13:06

mvh


2 Answers

Lets Break It Down

I know i'm a bit late to the party, but this is more for those that come across this in a search for answers. So here it is..

Background:

Rails introduced loads of new features in version 5.2, one of which was ActiveStorage. The official final release came out on April 9th, 2018.

  • Rails 5.2 Official Release Notes

Disclaimer:

So to be perfectly clear, the following information pertains to out-of-the-box vanilla active storage. This also doesn't take into account some crazy code-fu that revolves around some one off scenario.

With that said, the checksum is calculated differently depending on your Active Storage setup. With the vanilla out-of-the-box Rails Active Storage, there are 2 "types" (for lack of a better term) of configuration.

  1. Proxy Uploads
  2. Direct Uploads

Proxy Uploads

File Upload Flow: [Client] → [RoR App] → [Storage Service]

Comm. Flow: Can vary but in most cases it should be similar to File upload flow.

Pointed out above in SparkBao's answer is a "Proxy Upload". Meaning you upload the file to your RoR application and perform some sort of processing before sending the file to your configured storage service (AWS, Azure, Google, BackBlaze, etc...). Even if you set your storage service to "localdisk" the logic still technically applies, even though your RoR application is the storage endpoint.

A "Proxy Upload" approach isn't ideal for RoR applications that are deployed in the cloud on services like Heroku. Heroku has a hardset limit of 30 seconds to complete your transaction and send a response back to your client (end user). So if your file is fairly large, you need to consider the time it takes for your file to upload, and then account for the amount of time to calculate the checksum. If your caught in a scenario where you can't complete the request with a response in the 30 seconds you will need to use the "Direct Upload" approach.

Proxy Uploads Answer:

The Ruby class Digest::MD5 is used in the method compute_checksum_in_chunks(io) as pointed out by Spark.Bao.


Direct Uploads

File Upload Flow: [Client] → [Storage Service]

Comm. Flow: [Client] → [RoR App] → [Client] → [Storage Service] → [Client] → [RoR App] → [Client]

Our fine friends that maintain and develop Rails have already done all the heavy lifting for us. I won't go into details on how to setup a direct upload, but here is a link on how » Rails EdgeGuide - Direct Uploads.

Proxy Uploads Answer:

Now with all that said, with a vanilla out-of-the-box "Direct Uploads" setup, a file checksum is calculated by leveraging SparkMD5 (JavaScript).

Below is a snippet from the Rails Active Storage Source Code- (activestorage.js)

  var fileSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
  var FileChecksum = function() {
    createClass(FileChecksum, null, [ {
      key: "create",
      value: function create(file, callback) {
        var instance = new FileChecksum(file);
        instance.create(callback);
      }
    } ]);
    function FileChecksum(file) {
      classCallCheck(this, FileChecksum);
      this.file = file;
      this.chunkSize = 2097152;
      this.chunkCount = Math.ceil(this.file.size / this.chunkSize);
      this.chunkIndex = 0;
    }
    createClass(FileChecksum, [ {
      key: "create",
      value: function create(callback) {
        var _this = this;
        this.callback = callback;
        this.md5Buffer = new sparkMd5.ArrayBuffer();
        this.fileReader = new FileReader();
        this.fileReader.addEventListener("load", function(event) {
          return _this.fileReaderDidLoad(event);
        });
        this.fileReader.addEventListener("error", function(event) {
          return _this.fileReaderDidError(event);
        });
        this.readNextChunk();
      }
    },

Conclusion

If there is anything I missed I do apologize in advance. I tried to be as thorough as possible.

So to Sum things up the following should suffice as an acceptable answer:

  • Proxy Upload Configuration: The ruby class Digest::MD5

  • Direct Upload Configuration: The JavaScript hash library SparkMD5.

like image 92
user953533 Avatar answered Oct 15 '22 12:10

user953533


the source code is here: https://github.com/rails/rails/blob/6aca4a9ce5f0ae8af826945b272842dbc14645b4/activestorage/app/models/active_storage/blob.rb#L369-L377

def compute_checksum_in_chunks(io)
  Digest::MD5.new.tap do |checksum|
    while chunk = io.read(5.megabytes)
      checksum << chunk
    end

    io.rewind
  end.base64digest
end

in my project, I need to use this checksum value to judge whether the user uploads the duplicated file, I use the following code to get the same value with above method:

md5 = Digest::MD5.file(params[:file].tempfile.path).base64digest
puts "========= md5: #{md5}"

the output:

========= md5: F/9Inmc4zdQqpeSS2ZZGug==

database data:

pry(main)> ActiveStorage::Blob.find_by(checksum: 'F/9Inmc4zdQqpeSS2ZZGug==')
  ActiveStorage::Blob Load (2.7ms)  SELECT  "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."checksum" = $1 LIMIT $2  [["checksum", "F/9Inmc4zdQqpeSS2ZZGug=="], ["LIMIT", 1]]
=> #<ActiveStorage::Blob:0x00007f9a16729a90
id: 1,
key: "gpN2NSgfimVP8VwzHwQXs1cB",
filename: "15 Celebrate.mp3",
content_type: "audio/mpeg",
metadata: {"identified"=>true, "analyzed"=>true},
byte_size: 9204528,
checksum: "F/9Inmc4zdQqpeSS2ZZGug==",
created_at: Thu, 29 Nov 2018 01:38:15 UTC +00:00>
like image 31
Spark.Bao Avatar answered Oct 15 '22 12:10

Spark.Bao