Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hosting files on S3 and utilizing mysql

This question might be too general, but here goes...

I want to accept user-uploaded images and host them on S3. What's the best practice for this? I was thinking of the following:

Mysql - Create a table that holds all the image meta data:

  • Auto-increment id
  • The user ID of the uploader
  • A slug or path that points to its location in S3
  • Other image meta data? (size, width, height, etc)

S3 - Create a new bucket to hold the images

Site backend - Logic for handling upload:

  1. Accept user upload, validate file, etc
  2. Optionally process the image (resize, convert, etc)
  3. Upload to the appropriate S3 bucket w/ a new random slug
  4. If successful, add a new record to the mysql table

--

Is this standard practice for using S3 as a cloud provider w/ my web service? How can I make sure the database and S3 stay updated with one another? For example, what happens if a record is manually deleted from the database? How should I handle the orphaned S3 object? Or on the flip side what if an image is deleted from S3 but not corresponding record in the mysql table? Is that just up to me to write a script that verifies the integrity between the two systems?

like image 797
Jeff Avatar asked Nov 05 '14 17:11

Jeff


1 Answers

Take a look at the information returned by the Get Bucket/List Objects call.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

In particular, the Size, LastModified, and ETag.

Save these in your table when you upload.

List objects returns this information about objects in your bucket, with objects listed in sequential order, up to 1000 objects per request, and you can then continue where you left off, with the next request.

At $0.005 per 1,000 requests, you can audit a bucket with 8 million objects (such as one that I have) in only 8000 requests, for $0.04.

The ETag is particularly important, since on PUT requests, it's automatically set to the hex md5 hash of the object's body. (On multipart uploads, it's the hex md5 of the concatenated binary md5 of each part, followed by the number of parts). With this easily-fetched information, and the size, you have something to reconcile and have reasonable assurance that the object in the bucket is exactly what you believe it to be.

Then write a script that fetches the object list, and periodically compare to the database.

The other significant best practice that comes to mind is not to let scattered code touch the database and S3. Keep the code that touches both things together.


Additional consistency-related thought... the x-amz-meta-* user-defined headers in S3 are very useful; you can set them when you do the upload, or modify them in place later by "copying the object onto itself" through the API with modified metadata. You can store ~8K of metadata with each object, such as an id that correlates it to the database row that contains it. This information is not available from the "Get Bucket/List Object" call, you have to send an http HEAD request against the specific object in order to fetch the metadata... but if you ever wish you could figure out where a stranded bucket object came from, it would be good to have that information saved. Caveat, anybody authorized to download the object also gets a copy of the metadata in the response headers (easy to see if they are looking), so the only things you should store on public or widely-available objects would be things that are trivial and not sensitive. x-amz-meta-image-id: 1337 is probably safe if knowledge of the image id is of no particular consequence. Similarly, if it's a resized image, storing the original source image's MD5 or SHA is helpful for programmatic verification that yes, "this image" is a resized version of "that image," and even if someone gets their hands on this particular metadata, it's not of any real significance, since we have the rights to all of the images in question. Any sensitive or personal data should not be stored there for public content, but on non-public objects, the metadata is as secure as the object itself.

like image 170
Michael - sqlbot Avatar answered Nov 02 '22 22:11

Michael - sqlbot