I have a table containing about 30k records, that I'm attempting to iterate over and process with Django's ORM. Each record stores several binary blobs, which can each be several MB in size, that I need to process and write to a file.
However, I'm having trouble using Django for this because of memory constraints. I have 8GB of memory on my system, but after processing about 5k of records, the Python process is consuming all 8GB and gets killed by the Linux kernel. I've tried various tricks for clearing Django's query cache, like:
MyModel.objects.update()
settings.DEBUG=False
gc.collect()
However, none of these seem to have any noticeable effect, and my process continues to experience some sort of memory leak until it crashes.
Is there anything else I can do?
Since I only need to process each record one at a time, and I never need to access the same record again in the process, I have no need to save any model instance, or load more than one instance at a time. How do you ensure that only one record is loaded and that Django caches nothing and unallocates all memory immediately after use?
Try to use iterator.
A QuerySet typically caches its results internally so that repeated evaluations do not result in additional queries. In contrast, iterator() will read results directly, without doing any caching at the QuerySet level (internally, the default iterator calls iterator() and caches the return value). For a QuerySet which returns a large number of objects that you only need to access once, this can results in better performance and a significant reduction in memory.
It's a quote from django docs: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With