Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiency of django __in lookup for querysets

I have a complex database model set up in Django, and I have to do a number of calculations based on filter data. I have a Test object, a TestAttempt object, and a UserProfile object (with a foreign key back to test and a foreign key back to a userprofile). There is a method that I run on a TestAttempt that calculates the test score (based on a number of user-supplied choices compared to the correct answers associated with each test). And then another method that I run on a Test that calculates the average test score based on each of its associated TestAttempt's But sometimes I only want the average based on a supplied subset of the associated TestAttempt's that are linked with a particular set of UserProfiles. So instead of calculating the average test score for a particular test this way:

[x.score() for x in self.test_attempts.all()]

and then averaging these values. I do a query like this:

[x.score() for x in self.test_attempts.filter(profile__id__in=user_id_list).all()]

where user_id_list is a particular subset of UserProfile id's for which I want to find the average test score in the form of a list. My question is this: if user_id_list is indeed the entire set of UserProfile's (so the filter will return the same as self.test_attempts.all()) and most of the time this will be the case, does it pay to check for this case, and if so not execute the filter at all? or is the __in lookup efficient enough that even if user_id_list contains all users it'll be more efficient to run the filter. Also, do I need to worry about making the resulting test_attempts distinct()? or they can't possible turn up duplicates with the structure of my queryset?

EDIT: For anyone who's interested in looking at the raw SQL query, it looks like this without the filter:

SELECT "mc_grades_testattempt"."id", "mc_grades_testattempt"."date", 
"mc_grades_testattempt"."test_id", "mc_grades_testattempt"."student_id" FROM 
"mc_grades_testattempt" WHERE "mc_grades_testattempt"."test_id" = 1

and this with the filter:

SELECT "mc_grades_testattempt"."id", "mc_grades_testattempt"."date", 
"mc_grades_testattempt"."test_id", "mc_grades_testattempt"."student_id" FROM 
"mc_grades_testattempt" INNER JOIN "mc_grades_userprofile" ON 
("mc_grades_testattempt"."student_id" = "mc_grades_userprofile"."id") WHERE 
("mc_grades_testattempt"."test_id" = 1  AND "mc_grades_userprofile"."user_id" IN (1, 2, 3))

note that the array (1,2,3) is just an example

like image 740
ecbtln Avatar asked Jan 12 '12 01:01

ecbtln


2 Answers

  1. Short answer is – benchmark. Test it in different situations and measure the load. It will be the best answer.

  2. There can't be duplicates here.

  3. Is it really a problem to check for two situalions? Here's the hypotetic code:

    def average_score(self, user_id_list=None):
        qset = self.test_attempts.all()
        if user_id_list is not None:
            qset = qset.filter(profile__id__in=user_id_list)
        scores = [x.score() for x in qset]
        # and compute the average
    
  4. I don't know what does score method do, but can't you compute the average at DB level? It will give you much more noticable perfomance boost.

  5. And don't forget about caching.

like image 139
DrTyrsa Avatar answered Sep 20 '22 09:09

DrTyrsa


From what I understand of the documentation, all queries are built before they are actually used. So, for instance, test_attempts.all() generates SQL code once and when you execute the query, actually get data by doing something like .count(), for t in test_attempts.all():, etc., it runs the query on the database and returns a Queryset object or just an Object if you used get(). With that in mind, the number of calls to the database would be exactly the same, while the actual call would be different. As you show in your edited post, the raw queries are different, but they are both generated in the same way, before the data is access by Django. From a Django perspective, they would both be created in the same fashion, and then executed on the database. In my opinion, it would be best not to test for an all() situation, as you would have to run TWO queries to determine that. I believe you should run with the code you have and skip checking for the all() scenario, which you describe as the most common case. Most modern database engines run the queries in such a way that added joins do not hamper the performance metrics, as they process queries in optimal sequences, anyway.

like image 25
Furbeenator Avatar answered Sep 17 '22 09:09

Furbeenator