We have an app with highly interrelated data, i.e. there are many cases where two objects might refer to the same object via a relationship. As far as I can tell, Django does not make any attempt to return a reference to an already-fetched object if you attempt to fetch it via a different, previously unevaluated relationship.
For example:
class Customer( Model ):
firstName = CharField( max_length = 64 )
lastName = CharField( max_length = 64 )
class Order( Model ):
customer = ForeignKey( Customer, related_name = "orders" )
Then assume we have a single customer who has two orders in the DB:
order1, order2 = Order.objects.all()
print order1.customer # (1) One DB fetch here
print order2.customer # (2) Another DB fetch here
print order1.customer == order2.customer # (3) True, because PKs match
print id( order1.customer ) == id( order2.customer ) # (4) False, not the same object
When you have highly interrelated data, the degree to which accessing relationships of your objects results in repeated queries of the DB for the same data increases and becomes a problem.
We also program for iOS and one of the nice things about CoreData is that it maintains context, so that in a given context there is only ever one instance of a given model. In the example given above, CoreData would not have done the second fetch at (2), because it would have resolved the relationship using the customer already in memory.
Even if line (2) was replaced with a spurious example designed to force another DB fetch (like print Order.objects.exclude( pk = order1.pk ).get( customer = order1.customer )
), CoreData would realize that the result of that second fetch resolved to an model in memory and return the existing model instead of a new one (i.e. (4) would print True in CoreData because they would actually be the same object).
To hedge against this behaviour of Django, we are kinda writing all this horrible stuff to try to cache models in memory by their (type, pk)
and then check relationships with the _id
suffix to try to pull them from the cache before blindly hitting the DB with another fetch. This is cutting down on DB throughput but feels really brittle and likely to cause problems if normal relationship lookups via properties accidentally happen in some contrib framework or middleware that we don't control.
Are there any best practices or frameworks out there for Django to help avoid this problem? Has anyone attempted to install some kind of thread-local context into Django's ORM to avoid repeat lookups and having multiple in-memory instances mapping to the same DB model?
I know that query-caching stuff like JohnnyCache is out there (and helps cut down on the DB throughput) however there is still the issue of multiple instances mapping to the same underlying model even with those measures in place.
One of the most powerful features of Django is its Object-Relational Mapper (ORM), which enables you to interact with your database, like you would with SQL. In fact, Django's ORM is just a pythonical way to create SQL to query and manipulate your database and get results in a pythonic fashion.
Django's ORM is fantastic. It's slow because it chooses to be convenient but if it needs to be fast it's just a few slight API calls away.
The Django ORM is a very powerful tool, and one of the great attractions of Django. It makes writing simple queries trivial, and does a great job of abstracting away the database layer in your application. And sometimes, you shouldn't use it.
Main difference is that Django ORM uses the “active record implementation”, and SQL Alchemy uses “data mapper implementation”. It means that Django ORM cannot use our models for queries if every row is not linked with the overall model object.
David Cramer's django-id-mapper is one attempt to do this.
There's a relevant DB optimization page in django documentation; basically callables are not cached, but attributes are (subsequent calls to order1.customer
don't hit the database), though only in the context of their object owner (so, not sharing among different orders).
using cache
As you say, one way to solve your problem is to use a database cache. We use bitbucket's johnny cache, which is almost completely transparent; another good transparent one is mozilla's cache machine. You also have the choice for less-transparent caching systems that might actually better fit the bill, please see djangopackages/caching.
Adding a cache can indeed be very beneficial if different requests need to re-use the same Customer; but please read this wich applies to most transparent cache systems to think through if your Write/Read pattern suits such a caching system.
optimizing the requests
Another approach for your precise example is to use select_related
.
order1, order2 = Order.objects.all().select_related('customer')
This way the Customer
object will be loaded straight away in the same sql request, with little cost (unless it's a very big record) and no need to experiment with other packages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With