Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

random.sample on Django querysets: How will sampling on querysets affect performance?

Tags:

python

django

I was trying to sample a few records from my queryset for performance like:

from random import sample
from my_app import MyModel


my_models = MyModel.objects.all()

# sample only a few of records for performance
my_models_sample = sample(my_models, 5)

for model in my_models_sample:
    model.some_expensive_calculation

But I felt like it made only worse in terms of execution time.

How does random.sample() actually works behind the scene? And will it be rather performance burden on django querysets?

like image 933
June Avatar asked Aug 04 '15 06:08

June


People also ask

How does Django handle large data?

Use bulk query. Use bulk queries to efficiently query large data sets and reduce the number of database requests. Django ORM can perform several inserts or update operations in a single SQL query. If you're planning on inserting more than 5000 objects, specify batch_size.


3 Answers

Since random.sample() will force evaluate queryset my_models, the execution time of your program will heavily depend on the total number of MyModel objects in your database.

To improve performance and avoid loading entire query set into memory, you may end up implementing your own sampling function as described here together with .iterator() method.

Alternatively, you can also rely on database server to do the sampling for you via order_by('?') as follows:

MyModel.objects.order_by('?')[:5]

Personally, I wouldn't recommend the latter one as queries may be expensive and slow, depending on the database backend you’re using. (especially for MySQL)

like image 118
Ozgur Vatansever Avatar answered Nov 04 '22 03:11

Ozgur Vatansever


Why not let the database do the shuffling and limiting and compare the times?

MyModel.objects.order_by('?')[:5]

Although the documentation states that this may be expensive, in your case as you are fetching all the rows anyway, I suspect there will be a difference. The magnitude of the difference will depend on how big the data set is (and of course, your database backend).

like image 35
Burhan Khalid Avatar answered Nov 04 '22 03:11

Burhan Khalid


You are using random.sample() on a QuerySet object.

If you actually want to get 5 random samples as QuerySet then you can rather use this

random_objects = MyModel.objects.all().order_by('?')[:5]

This will get you 5 random objects and reduce your time of sampling.

PS: I will also check why is it taking so long that random.sample() is taking so much time for that operation, if ofcourse I find something. :)

like image 23
Arpit Goyal Avatar answered Nov 04 '22 03:11

Arpit Goyal