Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

django many-to-many field: prefetch primary keys only

I'm trying to optimize database queries for a Django app. Here's a simplified example:

class Label(models.Model):
    name = models.CharField(max_length=200)
    # ... many other fields ...

class Thing(models.Model):
    name = models.CharField(max_length=200)
    labels = models.ManyToManyField(Label)

I have a function that fetches all Labels and Things and puts them into a JSON data structure, in which Things refer to Labels using their ids (primary keys). Something like this:

{
    'labels': [
        { 'id': 123, 'name': 'label foo' },
        ...
    ],
    'things': [
        { 'id': 45, 'name': 'thing bar', 'labels': [ 123, ... ] },
        ...
    ]
}

What is the most efficient way of obtaining such a data structure using Django? Suppose I have L Labels and T Things, and the average Thing has x Labels.

Method 1:

data = {}
data['labels'] = [model_to_dict(label) for label in Label.objects.all()]
data['things'] = [model_to_dict(thing) for thing in Thing.objects.all()]

This makes (1 + 1 + T) database queries, since model_to_dict(thing) needs to fetch the Labels for each Thing individually.

Method 2:

data = {}
data['labels'] = [model_to_dict(label) for label in Label.objects.all()]
data['things'] = [model_to_dict(thing) for thing in
                    Thing.objects.prefetch_related('labels').all()]

This makes (1 + 1 + 1) database queries only, since the Things fetched now have their Labels prefetched in a single additional query.

This is still not satisfactory. prefetch_related('labels') will fetch many copies of the same Label, whereas I only need their ids. Is there any way to prefetch the ids of the Labels only? I tried prefetch_related('labels__id') but that didn't work. I am also concerned that because T is large (hundreds), prefetch_related('labels') results in a SQL query with a large IN clause. L is much smaller (< 10), so I could do this instead:

Method 3:

data = {}
data['labels'] = [model_to_dict(label) for label in
                    Label.objects.prefetch_related('thing_set').all()]
things = list(Thing.objects.all())
# plug in label ids by hand, and also fetch things that have zero labels
# somehow

This results in a smaller IN clause, but is still not satisfactory because prefetch_related('thing_set') fetches duplicate Things, if a Thing has multiple Labels.

Summary:

Label and Thing are connected by a ManyToManyField. I am fetching all Labels and Things anyway. So how do I also fetch their many-to-many relationship efficiently?

like image 360
cberzan Avatar asked Apr 23 '12 01:04

cberzan


1 Answers

I got it. Thanks to ilvar, whose comment to the question pointed me to through tables.

If you don’t specify an explicit through model, there is still an implicit through model class you can use to directly access the table created to hold the association. It has three fields to link the models.

Long story short:

# Fetch all labels and things:
labels = list(Label.objects.all())
things = list(Thing.objects.all())
# Fetch all label-thing pairs:
labels_of = defaultdict(lambda: [])
for pair in Thing.labels.through.objects.filter(label__in=labels):
    labels_of[pair.thing_id].append(pair.label_id)
# Put everything together:
data = {}
data['labels'] = [model_to_dict(label) for label in labels]
data['things'] = []
for thing in things:
    thing_dict = model_to_dict(thing, exclude='labels')
    thing_dict['labels'] = labels_of[thing.id]
    data['things'].append(thing_dict)

This makes (1 + 1 + 1) queries, and does not fetch anything repeatedly. I can also change the first for loop to:

for pair in Thing.labels.through.objects.filter(thing__in=things):

in case I have more Labels than Things, which will result in a query with a smaller IN clause.

Django-debug-toolbar's debugsqlshell management command is superb for actually seeing the queries that a piece of code is making.

like image 56
cberzan Avatar answered Nov 13 '22 05:11

cberzan