Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting Recent Rows, Optimization (Oracle SQL)

I would appreciate some guidance on the following query. We have a list of experiments and their current progress state (for simplicity, I've reduced the statuses to 4types, but we have 10 different statuses in our data). I need to eventually return a list of the current status of all non-finished experiments.

Given a table exp_status,

Experiment | ID     | Status
----------------------------
     A     |   1    | Starting 
     A     |   2    | Working On It
     B     |   3    | Starting
     B     |   4    | Working On It
     B     |   5    | Finished Type I
     C     |   6    | Starting
     D     |   7    | Starting
     D     |   8    | Working On It
     D     |   9    | Finished Type II
     E     |   10   | Starting
     E     |   11   | Working On It
     F     |   12   | Starting 
     G     |   13   | Starting
     H     |   14   | Starting
     H     |   15   | Working On It
     H     |   16   | Finished Type II

Desired Result Set:

  Experiment | ID   | Status
----------------------------
     A     |   2    | Working On It
     C     |   6    | Starting
     E     |   11   | Working On It
     F     |   12   | Starting 
     G     |   13   | Starting

The most recent ID number will correspond to the most recent status.

Now, the current code I have executes in 150 seconds.

    SELECT *
    FROM 
          (SELECT Experiment, ID, Status, 
          row_number () over (partition by Experiment
          order by ID desc) as rn
          FROM exp_status)
    WHERE rn = 1
    AND status NOT LIKE ('Finished%')

The thing is, this code wastes its time. The result set is 45 thousand rows pulled from a table of 3.9 million. This is because most experiments are in the finished status. The code goes through and orders all of them then only filters out the finished at the end. About 95% of the experiments in the table are in the finished phase. I could not figure out how to make the query first pick out all the experiments and statuses where there isn't a 'Finished' for that experiment. I tried the following but had very slow performance.

SELECT *
FROM exp_status
WHERE experiment NOT IN 
(
  SELECT experiment
  FROM exp_status
  WHERE status LIKE ('Finished%')
)

Any help would be appreciated!

like image 560
User Avatar asked Mar 05 '26 00:03

User


1 Answers

Given your requirement, I think your current query with with row_number() is one of the most efficient possible. This query takes time not because it has to sort the data, but because there is so much data to read in the first place (the extra cpu time is negligible compared to the fetch time). Furthermore, the first query makes a FULL SCAN that is really the best way to read lots of data.

You need to find a way to read a lot less rows if you want to improve performance. The second query doesn't go in the right direction:

  1. the inner query will likely be a full scan since the 'finished' rows will be spread across the whole table and likely represent a big percentage of all rows.
  2. the outer query will also likey be a full scan and a nice ANTI-HASH JOIN which should be quicker than 45k * (number of status change per experiment) non-unique index scans.

So the second query seems to have at least twice the number of reads (plus a join).

If you want to really improve performance, I think you will need a change of design.

You could for instance build a table of active experiments and join to this table. You would maintain this table either as a materialized view or with a modification to the code that inserts experiment statuses. You could go further and store the last status in this table. Maintaining this "last status" will likely be an extra burden but this could be justified by the improved performance.

like image 77
Vincent Malgrat Avatar answered Mar 07 '26 17:03

Vincent Malgrat



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!