Skip rows if subquery returns multiple rows in Postgres

Tags:

I want to update table prod_replay_out based on subquery results in Postgres. However, subquery returns multiple rows but I want to skip those rows and update table based on single rows return by subquery.

I have referred link Subquery returns more than 1 row error but max() function will not apply for my expected results. Could you please provide me some suggestion to modify query? Thank you.

prod_replay_out has following columns:

seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, in_msg_time, msg_type, cdsx_time

prod_replay_in has following columns:

seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, msg_type, cdsx_time

What I have tried?

Please find below update sql:

Update sql:

update prod_replay_out O  
   set in_msg_id = 
        (Select id
           From prod_replay_in I
          Where I.msg_type   = 'CDST010'
            and I.seller     = O.seller
            and I.buyer      = O.buyer
            and I.sender_tag = O.sender_tag
            and I.seller_tag = O.seller_tag
            and I.buyer_tag  = O.buyer_tag
            and I.isin       = O.isin
            and I.quantity   = O.quantity
            and I.cdsx_time  = O.in_msg_time
            and I.cdsx_time::text like '2020-05-12%'
         ) 
where O.msg_type = 'CDST01C'
and O.cdsx_time::text like '2020-05-12%';

I have tried below solution. Is it the correct approach or is there any loophole?

update prod_replay_out O  
   set in_msg_id = 
        (Select id
           From prod_replay_in I
          Where I.msg_type   = 'CDST010'
            and I.seller     = O.seller
            and I.buyer      = O.buyer
            and I.sender_tag = O.sender_tag
            and I.seller_tag = O.seller_tag
            and I.buyer_tag  = O.buyer_tag
            and I.isin       = O.isin
            and I.quantity   = O.quantity
            and I.cdsx_time  = O.in_msg_time
            and I.cdsx_time::text like '2020-05-12%'
            and 1 = (Select count(id)
                       From prod_replay_in I
                      Where I.msg_type   = 'CDST010'
                        and I.seller     = O.seller
                        and I.buyer      = O.buyer
                        and I.sender_tag = O.sender_tag
                        and I.seller_tag = O.seller_tag
                        and I.buyer_tag  = O.buyer_tag
                        and I.isin       = O.isin
                        and I.quantity   = O.quantity
                        and I.cdsx_time  = O.in_msg_time
                        and I.cdsx_time::text like '2020-05-12%'
                    )    
                )
where O.msg_type = 'CDST01C'
  and O.cdsx_time::text like '2020-05-12%';

619

asked Jun 22 '20 15:06

Jay Parmar

2 Answers

Query

Most importantly, don't use a correlated subquery. It's the inferior tool for the job. Use a subquery in the FROM clause.

This only updates where a single matching candidate row is found in the source table (neither none nor multiple), and only where it actually changes the value:

UPDATE prod_replay_out o
SET    in_msg_id = i.id
FROM  (
   SELECT i.id, i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time
   FROM   prod_replay_in i
   WHERE  i.msg_type   = 'CDST010'
   AND    i.cdsx_time >= '2020-05-12'     -- ① "sargable" expression
   AND    i.cdsx_time <  '2020-05-13'     -- ② don't cast to date, it's a valid timestamp literal
   AND    NOT EXISTS (                    -- ③ EXISTS is typically faster than counting
      SELECT FROM prod_replay_in x
      WHERE  x.id <> i.id                 -- ④ unique
      AND   (i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time)  -- ⑤ short syntax
        =   (x.seller, x.buyer, x.sender_tag, x.seller_tag, x.buyer_tag, x.isin, x.quantity, x.cdsx_time)
      )
   ) i
WHERE  o.msg_type = 'CDST01C'
AND   (i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time)
  =   (o.seller, o.buyer, o.sender_tag, o.seller_tag, o.buyer_tag, o.isin, o.quantity, o.in_msg_time)  -- ⑥ o.cdsx_time?
-- AND    o.cdsx_time >= '2020-05-12'     -- ⑦ redundant
-- AND    o.cdsx_time <  '2020-05-13'
AND   o.in_msg_id IS DISTINCT FROM i.id   -- ⑧ avoid empty updates
;

① Like GMB already suggested, transform this predicate to "sargable" expressions. This is faster, generally, and can use index support.

What does the word “SARGable” really mean?
Calculate number of concurrent events in SQL

② But don't cast to date if cdsx_time is a timestamp column (as seems likely). '2020-05-12' is a perfectly valid timestamp literal, signifying the first instance of the day. See:

Generating time series between two dates in PostgreSQL

If it's a timestamptz column, consider the possible influence of the timezone setting! See:

Ignoring time zones altogether in Rails and PostgreSQL

③ EXISTS is typically (much) more efficient than counting all rows, as it can stop as soon as another row is found. Especially if there can be many peers, and index support is available. See:

Select rows which are not present in other table

④ Assuming id is unique (or PK). Else use the system column ctid for the job. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

⑤ Convenient, equivalent short syntax with ROW values. See:

Enforcing index scan for multicolumn comparison

⑥ Your query has:

and I.cdsx_time  = O.in_msg_time         -- !?
and I.cdsx_time::text like '2020-05-12%'

... but:

O.cdsx_time::text like '2020-05-12%'

You didn't mean to write and I.cdsx_time = O.cdsx_time?

⑦ Would be noise. The restriction is already enforced in the subquery. (Doesn't help index support, either.)

⑧ This one is important if some columns may already have the desired value. Then the operation is skipped instead of writing an identical row version at full cost.

If both columns are defined NOT NULL, simplify to o.in_msg_id <> i.id. Again, see:

Update a column of a table with a column of another table in PostgreSQL

Indices

If performance is an issue or you run this repeatedly, consider indices like the following:

For the first (in order of the expected query plan!) step of identifying source row candidates:

CREATE INDEX foo ON prod_replay_in (msg_type, cdsx_time);

For the second step of ruling out duplicates:

CREATE INDEX foo ON prod_replay_in (seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, cdsx_time);

Or any small subset that is selective enough. A smaller index on fewer columns is typically more efficient if it includes relatively few additional rows as "false positives" in the index scan. While relatively few, these are eliminated cheaply in the following FILTER step.

For the final step of identifying target rows:

CREATE INDEX foo ON prod_replay_out (msg_type, in_msg_time);

Again: or any small subset that is selective enough.

answered Nov 02 '22 22:11

Erwin Brandstetter

You want to update only when the subquery returns one row. One option uses aggreation and having in the subquery:

update prod_replay_out o  
set in_msg_id = (
    select max(id)
    from prod_replay_in i
    where 
        i.msg_type       = 'cdst010'
        and i.seller     = o.seller
        and i.buyer      = o.buyer
        and i.sender_tag = o.sender_tag
        and i.seller_tag = o.seller_tag
        and i.buyer_tag  = o.buyer_tag
        and i.isin       = o.isin
        and i.quantity   = o.quantity
        and i.cdsx_time  = o.in_msg_time
        and i.cdsx_time  >= '2020-05-12'::date
        and i.cdsx_time  <  '2020-05-13'::date
    having count(*) = 1
) 
where 
    o.msg_type = 'cdst01c'
    and o.cdsx_time  >= '2020-05-12'::date
    and o.cdsx_time  <  '2020-05-13'::date

Note that I rewrote the date filters to avoid the conversion to text (you can use an half-open interval with date literals instead, which is by far more efficient).

Note that this updates in_msg_id to null when the subquery would have returned multiple rows (or no rows at all). If you want to avoid that, you can filter in the where clause:

update prod_replay_out o  
set in_msg_id = (
    select max(id)
    from prod_replay_in i
    where 
        i.msg_type       = 'cdst010'
        and i.seller     = o.seller
        and i.buyer      = o.buyer
        and i.sender_tag = o.sender_tag
        and i.seller_tag = o.seller_tag
        and i.buyer_tag  = o.buyer_tag
        and i.isin       = o.isin
        and i.quantity   = o.quantity
        and i.cdsx_time  = o.in_msg_time
        and i.cdsx_time  >= '2020-05-12'::date
        and i.cdsx_time  <  '2020-05-13'::date
    having count(*) = 1
) 
where 
    o.msg_type = 'cdst01c'
    and o.cdsx_time  >= '2020-05-12'::date
    and o.cdsx_time  <  '2020-05-13'::date
    and (
        select count(*)
        from prod_replay_in i
        where 
            i.msg_type       = 'cdst010'
            and i.seller     = o.seller
            and i.buyer      = o.buyer
            and i.sender_tag = o.sender_tag
            and i.seller_tag = o.seller_tag
            and i.buyer_tag  = o.buyer_tag
            and i.isin       = o.isin
            and i.quantity   = o.quantity
            and i.cdsx_time  = o.in_msg_time
            and i.cdsx_time  >= '2020-05-12'::date
            and i.cdsx_time  <  '2020-05-13'::date
    ) = 1

answered Nov 02 '22 23:11

GMB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Skip rows if subquery returns multiple rows in Postgres

Tags:

sql

select

postgresql

sql-update

subquery

Jay Parmar

People also ask

2 Answers

Query

Indices

Erwin Brandstetter

GMB

Recent Activity

Donate For Us

Skip rows if subquery returns multiple rows in Postgres

Tags:

sql

select

postgresql

sql-update

subquery

Jay Parmar

People also ask

2 Answers

Query

Indices

Erwin Brandstetter

GMB

Related questions

Recent Activity

Donate For Us