We are doing background data processing with Django Celery, taking a CSV file (up to 15MB), converting it into list of dict data (which also includes some Django model objects), and breaking it up into chunks to process in sub tasks:
@task
def main_task(data):
i = 0
for chunk in chunk_up(data):
chunk_id = "chunk_id_{}".format(i)
cache.set(chunk_id, chunk, timeout=FIVE_HOURS)
sub_task.delay(chunk_id)
i += 1
@task
def sub_task(chunk_id):
data_chunk = cache.get(chunk_id)
... # do processing
All tasks run in concurrent processes in the background managed by Celery. We originally used Redis backend but found it would routinely run out of memory during peak load scenarios and high concurrency. So we switched to Django's filebased cache backend. Although that fixed the memory issue, we saw that 20-30% of the cache entries never got written. No error thrown, just silent failure. When we go back and look up the cache from CLI, we see that for e.g. chunk_id_7 and chunk_id_9 would exist, but chunk_id_8 would not. So intermittently, some cache entries are failing to get saved.
We swapped in diskcache backend and are observing the same thing, though cache failures seem to be reduced to 5-10% (very rough estimate).
We noticed that in past there where concurrent process issues with Django filebased cache, but it seems to have been fixed many years ago (we are on v1.11). One comment says that this cache backend is more of a POC, though again not sure if it's changed since then.
Is filebased cache a production-quality caching solution? If yes, what could be causing our write failures? If not, what's a better solution for our use case?
In both Django FileBased and DiskCache DjangoCache, the issue was that the caches got full and culled in background by the respective backends. In case of Django FB, the culling happens when MAX_ENTRIES
in cache is reached (default 300), at which point it randomly removes a fraction of entries based on CULL_FREQUENCY
(default 33%). So our cache was getting full and random entries were getting deleted, which of course causes cache.get()
in sub_task
to fail on certain chunks if its entry was randomly deleted.
For DiskCache, the default cache size_limit
is 1GB. When it gets reached entries are culled based on EVICTION_POLICY
which is defaulted to least recently used. In our case after the size_limit
was reached, it was deleting entries that were still in use, albiet least recently.
After understanding this we tried using DiskCache with EVICTION_POLICY = 'none'
to avoid culling in any case. This almost worked, but for a small (< 1%) of cache entries, we were still seeing cache.get()
fail to get an entry that actually existed in the cache. Maybe a SQLLite error? Even after adding retry=True
on every cache.get()
call, it would still fail to get cache entries that actually exist in the cache some fraction of the time.
So we ended up implementing a more deterministic FileBasedCache that seems to do the trick:
from django.core.cache.backends.filebased import FileBasedCache as DjangoFileBasedCached
class FileBasedCache(DjangoFileBasedCached):
def _cull(self):
'''
In order to make the cache deterministic,
rather than randomly culling,
simply remove all expired entries
Use MAX_ENTRIES to avoid checking every file in the cache
on every set() operation. MAX_ENTRIES sh be set large enough
so that when it's hit we can be pretty sure there will be
expired files. If set too low then we will be checking
for expired files too frequently which defeats the purpose of MAX_ENTRIES
:return:
'''
filelist = self._list_cache_files()
num_entries = len(filelist)
if num_entries < self._max_entries:
return # return early if no culling is required
if self._cull_frequency == 0:
return self.clear() # Clear the cache when CULL_FREQUENCY = 0
for fname in filelist:
with io.open(fname, 'rb') as f:
# is_expired automatically deletes what's expired
self._is_expired(f)
Taking a step back, what we really need is a persistent and reliable store for large data to access across Celery tasks. We are using Django cache for this, but perhaps it's not the right tool for the job? A cache isn't really meant to be 100% reliable. Is there another approach we should be using to solve the basic problem of passing around large data between Celery tasks?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With