Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up django nested for loop time series

I am working on a django-based open source project called OpenREM (http://demo.openrem.org/openrem/, http://openrem.org).

To calculate data for one of the plots that are used I am carrying out a series of queries to obtain the number of items that fall into each of the 24 hours for each day of the week. This data is used to plot the pie chart of studies per weekday on the CT page of the demo site, with a drill-down to studies per hour for that day:

studiesPerHourInWeekdays = [[0 for x in range(24)] for x in range(7)]
for day in range(7):
    studyTimesOnThisWeekday = f.qs.filter(study_date__week_day=day+1).values('study_time')
    if studyTimesOnThisWeekday:
        for hour in range(24):
            try:
                studiesPerHourInWeekdays[day][hour] = studyTimesOnThisWeekday.filter(study_time__gte = str(hour)+':00').filter(study_time__lte = str(hour)+':59').values('study_time').count()
            except:
                studiesPerHourInWeekdays[day][hour] = 0

This takes a little while to run on a production system. I think the second FOR loop could be removed by using a qsstats-magic time_series, aggregated over hours. Unfortunately there isn't a suitable datetime object stored in the database that I can use for this.

Does anyone know how I can combine the "study_date" datetime.date object and "study_time" datetime.time object into a single datetime.datetime object for me to be able to run a qsstats-magic time_series by hour?

Thanks,

David

like image 438
David Avatar asked Nov 09 '22 13:11

David


1 Answers

If you can at all (though you don't seem able given your circumstance) it would be best to change the database schema to reflect the kinds of queries you're making. A datetime field that had this information, some type of foreign key set up, etc.

You probably already know that, though, so the practical answer to your question is that you want to use the underlying database tools to your advantage through an extra() call. Maybe something like this* if you're using postgres:

date_hour_set = f.qs.extra(
    select={
        'date_hour': "study_date + interval '1h' * date_part('hour', study_time)",
        'date_hour_count': "count(study_date + interval '1h' * date_part('hour', study_time))"
    }).values('date_hour', 'date_hour_count').distinct()

which would give you queryset of datetimes (hours only) with their associated occurrence count. Handwritten SQL will give you the easiest option at the moment because of Django's lagging TimeField support, and will probably be the most performant, too.

*Note I don't write SQL regularly and am being lazy, so there are cleaner ways to work this.

If you really really need to be database portable and still can't edit the schema, you can stack together features of Django aggregation that are maybe a little convoluted all together:

from django.db.models import Value, Count, ExpressionWrapper, CharField
from django.db.models.functions import Substr, Concat

hour_counts = f.qs.annotate(hour=Concat(Substr('study_time', 1, 2), Value(':00:00')))
date_hour_pairs = hour_counts.annotate(
        date_hour=ExpressionWrapper(Concat('study_date', 'hour'),
        output_field=CharField())).values('study_date', 'hour', 'date_hour')
date_hour_counts = date_hour_pairs.annotate(count=Count('date_hour')).distinct()

which should give you a set of dicts with a datetime.time object for 'hour', the datetime.date you started with for 'study_date', a concatted string version of the date and time under 'date_hour', and then the all important (date, hour) count under 'count'.

like image 60
R Phillip Castagna Avatar answered Nov 14 '22 23:11

R Phillip Castagna