Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

django rest framework - backward serialization to avoid prefetch_related

I have two models, Item and ItemGroup:

class ItemGroup(models.Model):
   group_name = models.CharField(max_length=50)
   # fields..

class Item(models.Model):
   item_name = models.CharField(max_length=50)
   item_group = models.ForeignKey(ItemGroup, on_delete=models.CASCADE)
   # other fields..

I want to write a serializer that will fetch all item groups with their item list as a nested array.

So I want this output:

[ {group_name: "item group name", "items": [... list of items ..] }, ... ]

As I see, I should write this with django rest framework:

class ItemGroupSerializer(serializers.ModelSerializer):
   class Meta:
      model = ItemGroup
      fields = ('item_set', 'group_name') 

Means, I have to write a serializer for ItemGroup (not for Item). To avoid many queries I pass this queryset:

ItemGroup.objects.filter(**filters).prefetch_related('item_set')

The problem that I see is, for a large dataset, prefetch_related results in an extra query with a VERY large sql IN clause, which I could avoid with the query on the Item objects instead:

Item.objects.filter(**filters).select_related('item_group')

Which results in a JOIN which is way better.

Is it possible to query Item instead of ItemGroup, and yet to have the same serialization output?

like image 770
user3599803 Avatar asked Dec 20 '18 18:12

user3599803


People also ask

What is the difference between ModelSerializer and HyperlinkedModelSerializer?

The HyperlinkedModelSerializer class is similar to the ModelSerializer class except that it uses hyperlinks to represent relationships, rather than primary keys. By default the serializer will include a url field instead of a primary key field.

Is serialization necessary in django?

It is not necessary to use a serializer. You can do what you would like to achieve in a view. However, serializers help you a lot. If you don't want to use serializer, you can inherit APIView at a function-based-view.

How do you pass extra context data to Serializers in Django REST framework?

In function-based views, we can pass extra context to serializer with “context” parameter with a dictionary. To access the extra context data inside the serializer we can simply access it with “self. context”. From example, to get “exclude_email_list” we just used code 'exclude_email_list = self.

What is the purpose of serialization in Django REST framework?

Serializers in Django REST Framework are responsible for converting objects into data types understandable by javascript and front-end frameworks. Serializers also provide deserialization, allowing parsed data to be converted back into complex types, after first validating the incoming data.


2 Answers

Using prefetch_related you will have two queries + the big IN clauses issue, although it is proven and portable.

I would give a solution that is more an example, based on your field names. It will create a function that transform from a serializer for Item using your select_related queryset. It will override the list function of the view and transform from one serializer data to the other one that will give you the representation you want. It will use only one query and parsing the results will be in O(n) so it should be fast.

You might need to refactor get_data in order to add more fields to your results.

class ItemSerializer(serializers.ModelSerializer):
    group_name = serializers.CharField(source='item_group.group_name')

    class Meta:
        model = Item
        fields = ('item_name', 'group_name')

class ItemGSerializer(serializers.Serializer):
    group_name = serializers.CharField(max_length=50)
    items = serializers.ListField(child=serializers.CharField(max_length=50))

In the view:

class ItemGroupViewSet(viewsets.ModelViewSet):
    model = models.Item
    serializer_class = serializers.ItemSerializer
    queryset = models.Item.objects.select_related('item_group').all()

    def list(self, request, *args, **kwargs):
        queryset = self.filter_queryset(self.get_queryset())

        page = self.paginate_queryset(queryset)
        if page is not None:
            serializer = self.get_serializer(page, many=True)
            data = self.get_data(serializer.data)
            s = serializers.ItemGSerializer(data, many=True)
            return self.get_paginated_response(s.data)

        serializer = self.get_serializer(queryset, many=True)
        data = self.get_data(serializer.data)
        s = serializers.ItemGSerializer(data, many=True)
        return Response(s.data)

    @staticmethod
    def get_data(data):
        result, current_group = [], None
        for elem in data:
            if current_group is None:
                current_group = {'group_name': elem['group_name'], 'items': [elem['item_name']]}
            else:
                if elem['group_name'] == current_group['group_name']:
                    current_group['items'].append(elem['item_name'])
                else:
                    result.append(current_group)
                    current_group = {'group_name': elem['group_name'], 'items': [elem['item_name']]}

        if current_group is not None:
            result.append(current_group)
        return result

Here is my result with my fake data:

[{
    "group_name": "group #2",
    "items": [
        "first item",
        "2 item",
        "3 item"
    ]
},
{
    "group_name": "group #1",
    "items": [
        "g1 #1",
        "g1 #2",
        "g1 #3"
    ]
}]
like image 171
edilio Avatar answered Oct 08 '22 16:10

edilio


Let's start off with the basics

A serializer can only work with the data it is given

So this means that in order to get a serializer which can serialize a list of ItemGroup and Item objects in a nested representation, it has to be given that list in the first place. You've accomplished that so far using a query on the ItemGroup model that calls prefetch_related to get the related Item objects. You've also identified that prefetch_related triggers a second query to get those related objects, and this isn't satisfactory.

prefetch_related is used to get multiple related objects

What does this mean exactly? When you are querying for a single object, like a single ItemGroup, you use prefetch_related to get a relationship containing multiple related objects, like a reverse foreign key (one-to-many) or a many-to-many relationship that's been defined. Django intentionally uses a second query to get these objects for a few reasons

  1. The join that would be required in a select_related is often non-performant when you force it to do a join against a second table. This is because a right outer join would be required in order to ensure that no ItemGroup objects that do not contain an Item are missed.
  2. The query used by prefetch_related is an IN on an indexed primary key field, which is one of the most performant queries out there.
  3. The query only requests the IDs of Item objects it knows exist, so it can efficiently handle duplicates (in the case of many-to-many relationships) without having to do an additional subquery.

All of this is a way to say: prefetch_related is doing exactly what it should do, and it's doing that for a reason.

But I want to do this with a select_related anyway

Alright, alright. That's what was asked for, so let's see what can be done.

There are a few ways to accomplish this, all of which have their pros and cons and none of which work without some manual "stitching" work in the end. I am making the assumption that you aren't using the built-in ViewSet or generic views provided by DRF, but if you are then the stitching must happen in the filter_queryset method to allow the built-in filtering to work. Oh, and it probably breaks pagination or makes it almost useless.

Preserving the original filters

The original set of filters are being applied to the ItemGroup object. And since this is being used in an API, these are probably dynamic and you don't want to lose them. So, you are going to need to apply filters through one of two ways:

  1. Generate the filters and then prefix them with the related name

    So you would generate your normal foo=bar filters and then prefix them before passing it to filter() so it'd be related__foo=bar. This may have some performance implications since you're now filtering across relationships.

  2. Generate the original subquery and then pass it to the Item query directly

    This is probably the "cleanest" solution, except you're generating an IN query with comparable performance to the prefetch_related one. Except it's worse performance, since this is treated as an uncacheable subquery instead.

Implementing both of these are realistically out of the scope of this question, since we want to be able to "flip and stitch" the Item and ItemGroup objects so the serializer works.

Flipping the Item query so you get a list of ItemGroup objects

Taking the query given in the original question, where select_related is being used to grab all of the ItemGroup objects alongside the Item objects, you are returned a queryset full of Item objects. We actually want a list of ItemGroup objects, since we're working with an ItemGroupSerializer, so we're going to have to "flip it" around.

from collections import defaultdict

items = Item.objects.filter(**filters).select_related('item_group')

item_groups_to_items = defaultdict(list)
item_groups_by_id = {}

for item in items:
    item_group = item.item_group

    item_groups_by_id[item_group.id] = item_group
    item_group_to_items[item_group.id].append(item)

I am intentionally using the id of the ItemGroup as the key for the dictionaries since most Django models are not immutable, and sometimes people override the hashing method to be something other than the primary key.

This will get you a mapping of ItemGroup objects to their related Item objects, which is ultimately what you need in order to "stitch" them together again.

Stitching the ItemGroup objects back with their related Item objects

This part isn't actually difficult to do, since you have all of the related objects already.

for item_group_id, item_group_items in item_group_to_items.items():
    item_group = item_groups_by_id[item_group_id]

    item_group.item_set = item_group_items

item_groups = item_groups_by_id.values()

This will get you all of the ItemGroup objects that were requested and have them stored as list in the item_groups variable. Each ItemGroup object will have the list of related Item objects set in the item_set attribute. You may want to rename this so it doesn't conflict with the automatically generated reverse foreign key of the same name.

From here, you can use it as you normally would in your ItemGroupSerializer and it should work for serialization.

Bonus: A generic way to "flip and stitch"

You can make this generic (and unreadable) pretty quickly, for use in other similar scenarios:

def flip_and_stitch(itmes, group_from_item, store_in):
    from collections import defaultdict

    item_groups_to_items = defaultdict(list)
    item_groups_by_id = {}

    for item in items:
        item_group = getattr(item, group_from_item)

        item_groups_by_id[item_group.id] = item_group
        item_group_to_items[item_group.id].append(item)

    for item_group_id, item_group_items in item_group_to_items.items():
        item_group = item_groups_by_id[item_group_id]

        setattr(item_group, store_in, item_group_items)

    return item_groups_by_id.values()

And you'd just call this as

item_groups = flip_and_stitch(items, 'item_group', 'item_set')

Where:

  • items is the queryset of items that you requested originally, with the select_related call already applied.
  • item_group is the attribute on the Item object where the related ItemGroup is stored.
  • item_set is the attribute on the ItemGroup object where the list of related Item objects should be stored.
like image 43
Kevin Brown-Silva Avatar answered Oct 08 '22 18:10

Kevin Brown-Silva