Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize query to compute row-dependent datetime relationships?

Say I have a simplified model in which a patient can have zero or more events. An event has a category and a date. I want to support questions like:

Find all patients that were given a medication after an operation and 
the operation happened after an admission. 

Where medication, operation and admission are all types of event categories. There are ~100 possible categories.

I'm expecting 1000s of patients and every patient has ~10 events per category.

The naive solution I came up with was to have two tables, a patient and an event table. Create an index on event.category and then query using inner-joins like:

SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
    ON  medication.patient_id = patient.id
    AND medication.category = 'medication'
INNER JOIN event AS operation
    ON  operation.patient_id = patient.id
    AND operation.category = 'operation'
INNER JOIN event AS admission
    ON  admission.patient_id = patient.id
    AND admission.category = 'admission'
WHERE medication.date > operation.date
    AND operation.date > admission.date;

However this solution does not scale well as more categories/filters are added. With 1,000 patients and 45,000 events I see the following performance behaviour:

| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2                     | 100ms                  |
| 3                     | 500ms                  |
| 4                     | 2000ms                 |
| 5                     | 8000ms                 | 

Explain: explain

Does anyone have any suggestions on how to optimize this query/data model?

Extra info:

  • Postgres 10.6
  • In the Explain output, project_result is equivalent to patient in the simplified model.

Advanced use case:

Find all patients that were given a medication within 30 days after an 
operation and the operation happened within 7 days after an admission.
like image 762
shusson Avatar asked Jan 02 '19 15:01

shusson


2 Answers

First, if referential integrity is enforced with FK constraints, you can drop the patient table from the query completely:

SELECT COUNT(DISTINCT patient)  -- still not optimal
FROM   event a
JOIN   event o USING (patient_id)
JOIN   event m USING (patient_id)
WHERE  a.category = 'admission'
AND    o.category = 'operation'
AND    m.category = 'medication'
AND    m.date > o.date
AND    o.date > a.date;

Next, get rid of the repeated multiplication of rows and the DISTINCT to counter that in the outer SELECT by using EXISTS semi-joins instead:

SELECT COUNT(*)
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date > a.date
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date > o.date
      )
   )
AND    a.category = 'admission';

Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.

If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):

SELECT count(*)
FROM   patient p
CROSS  JOIN LATERAL ( -- get earliest admission
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'admission'
   ORDER  BY e.date
   LIMIT  1
   ) a
CROSS  JOIN LATERAL ( -- get earliest operation after that
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'operation'
   AND    e.date > a.date
   ORDER  BY e.date
   LIMIT  1
   ) o
WHERE EXISTS (  -- the *last* step can still be a plain EXISTS
      SELECT FROM event m
      WHERE  m.patient_id = p.id
      AND    m.category = 'medication'
      AND    m.date > o.date
      );

See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest record per user

You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer (or even int2 or "char" value as FK.)

For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC) and make sure all three columns are defined NOT NULL. The order of index expressions is important. DESC is mostly optional here. Postgres can use the index with default ASC sort order almost as efficiently in your case.

If VACUUM (preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.

Related:

  • Optimizing queries on a range of timestamps (two columns)
  • Select Items that has one item but not the other
  • How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?

To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.

You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id of the referenced event!) Dates / timestamps alone are error-prone.

SELECT COUNT(*)                    -- to count cases
   --  COUNT(DISTINCT patient_id)  -- to count patients
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date >= a.date      -- or ">"
   AND    o.date <  a.date + 7  -- based on data type "date"!
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date >= o.date       -- or ">"
      AND    m.date <  o.date + 30  -- syntax for timestamp is different
      )
   )
AND    a.category = 'admission';

About date / timestamp arithmetic:

  • How to get the end of a day?
like image 139
Erwin Brandstetter Avatar answered Nov 14 '22 20:11

Erwin Brandstetter


You might find that conditional aggregation does what you want. The time component can be difficult to handle (see below), if your sequences get complicated, but the basic idea:

select e.patient_id
from events e
group by e.patient_id
having (max(date) filter (where e.category = 'medication') > 
        min(e.date) filter (where e.category = 'operation')
       ) and
       (min(date) filter (where e.category = 'operation') >
        min(e.date) filter (where e.category = 'admission'
       );

This can be generalized for further categories.

Using group by and having should have the consistent performance characteristics that you want (although for simple queries is might be slower). The trick with this -- or any approach -- is what happens when there are multiple categories for a given patient.

For instance, this or your approach will find:

admission --> operation --> admission --> medication

I suspect that you don't really want to find these records. You probably need an intermediate level, representing some sort of "episode" for a given patient.

If that that is the case, you should ask another question with clearer examples both of the data, questions you might want to ask, and cases that match and do not match the conditions.

like image 2
Gordon Linoff Avatar answered Nov 14 '22 20:11

Gordon Linoff