Say I have a simplified model in which a patient
can have zero or more events
. An event has a category
and a date
. I want to support questions like:
Find all patients that were given a medication after an operation and
the operation happened after an admission.
Where medication, operation and admission are all types of event categories. There are ~100 possible categories.
I'm expecting 1000s of patients and every patient has ~10 events per category.
The naive solution I came up with was to have two tables, a patient
and an event
table. Create an index on event.category
and then query using inner-joins like:
SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
ON medication.patient_id = patient.id
AND medication.category = 'medication'
INNER JOIN event AS operation
ON operation.patient_id = patient.id
AND operation.category = 'operation'
INNER JOIN event AS admission
ON admission.patient_id = patient.id
AND admission.category = 'admission'
WHERE medication.date > operation.date
AND operation.date > admission.date;
However this solution does not scale well as more categories/filters are added. With 1,000 patients and 45,000 events I see the following performance behaviour:
| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2 | 100ms |
| 3 | 500ms |
| 4 | 2000ms |
| 5 | 8000ms |
Explain:
Does anyone have any suggestions on how to optimize this query/data model?
Extra info:
project_result
is equivalent to patient
in the simplified model.Advanced use case:
Find all patients that were given a medication within 30 days after an
operation and the operation happened within 7 days after an admission.
First, if referential integrity is enforced with FK constraints, you can drop the patient
table from the query completely:
SELECT COUNT(DISTINCT patient) -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;
Next, get rid of the repeated multiplication of rows and the DISTINCT
to counter that in the outer SELECT
by using EXISTS
semi-joins instead:
SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';
Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.
If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):
SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);
See:
You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer
(or even int2
or "char"
value as FK.)
For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC)
and make sure all three columns are defined NOT NULL
. The order of index expressions is important. DESC
is mostly optional here. Postgres can use the index with default ASC
sort order almost as efficiently in your case.
If VACUUM
(preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.
Related:
To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.
You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id
of the referenced event!) Dates / timestamps alone are error-prone.
SELECT COUNT(*) -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';
About date
/ timestamp
arithmetic:
You might find that conditional aggregation does what you want. The time component can be difficult to handle (see below), if your sequences get complicated, but the basic idea:
select e.patient_id
from events e
group by e.patient_id
having (max(date) filter (where e.category = 'medication') >
min(e.date) filter (where e.category = 'operation')
) and
(min(date) filter (where e.category = 'operation') >
min(e.date) filter (where e.category = 'admission'
);
This can be generalized for further categories.
Using group by
and having
should have the consistent performance characteristics that you want (although for simple queries is might be slower). The trick with this -- or any approach -- is what happens when there are multiple categories for a given patient.
For instance, this or your approach will find:
admission --> operation --> admission --> medication
I suspect that you don't really want to find these records. You probably need an intermediate level, representing some sort of "episode" for a given patient.
If that that is the case, you should ask another question with clearer examples both of the data, questions you might want to ask, and cases that match and do not match the conditions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With