Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django StaticFiles and Amazon S3: How to detect modified files?

I'm using django staticfiles + django-storages and Amazon S3 to host my data. All is working well except that every time I run manage.py collectstatic the command uploads all files to the server.

It looks like the management command compares timestamps from Storage.modified_time() which isn't implemented in the S3 storage from django-storages.

How do you guys determine if an S3 file has been modified?

I could store file paths and last modified data in my database. Or is there an easy way to pull the last modified data from Amazon?

Another option: it looks like I can assign arbitrary metadata with python-boto where I could put the local modified date when I upload the first time.

Anyways, it seems like a common problem so I'd like to ask what solution others have used. Thanks!

like image 813
Yuji 'Tomita' Tomita Avatar asked Jul 07 '11 22:07

Yuji 'Tomita' Tomita


2 Answers

The latest version of django-storages (1.1.3) handles file modification detection through S3 Boto.

pip install django-storages and you're good now :) Gotta love open source!

Update: set the AWS_PRELOAD_METADATA option to True in your settings file to have very fast syncs if using the S3Boto class. If using his S3, use his PreloadedS3 class.


Update 2: It's still extremely slow to run the command.


Update 3: I forked the django-storages repository to fix the issue and added a pull request.

The problem is in the modified_time method where the fallback value is being called even if it's not being used. I moved the fallback to an if block to be executed only if get returns None

entry = self.entries.get(name, self.bucket.get_key(self._encode_name(name)))

Should be

    entry = self.entries.get(name)
    if entry is None:
        entry = self.bucket.get_key(self._encode_name(name))

Now the difference in performance is from <.5s for 1000 requests from 100s


Update 4:

For synching 10k+ files, I believe boto has to make multiple requests since S3 paginates results causing a 5-10 second synch time. This will only get worse as we get more files.

I'm thinking a solution is to have a custom management command or django-storages update where a file is stored on S3 which has the metadata of all other files, which is updated any time a file is updated via the collectstatic command.

It won't detect files uploaded via other means but won't matter if the sole entry point is the management command.

like image 122
Yuji 'Tomita' Tomita Avatar answered Nov 03 '22 02:11

Yuji 'Tomita' Tomita


I answered the same question here https://stackoverflow.com/a/17528513/1220706 . Check out https://github.com/FundedByMe/collectfast . It's a pluggable Django app that caches the ETag of remote S3 files and compares the cached checksum instead of performing a lookup every time. Follow the installation instructions and run collectstatic as normal. It took me from an average around 1m30s to about 10s per deploy.

like image 21
antonagestam Avatar answered Nov 03 '22 01:11

antonagestam