Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip rows if subquery returns multiple rows in Postgres

I want to update table prod_replay_out based on subquery results in Postgres. However, subquery returns multiple rows but I want to skip those rows and update table based on single rows return by subquery.

I have referred link Subquery returns more than 1 row error but max() function will not apply for my expected results. Could you please provide me some suggestion to modify query? Thank you.

prod_replay_out has following columns:

seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, in_msg_time, msg_type, cdsx_time

prod_replay_in has following columns:

seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, msg_type, cdsx_time

What I have tried?

Please find below update sql:

Update sql:

update prod_replay_out O  
   set in_msg_id = 
        (Select id
           From prod_replay_in I
          Where I.msg_type   = 'CDST010'
            and I.seller     = O.seller
            and I.buyer      = O.buyer
            and I.sender_tag = O.sender_tag
            and I.seller_tag = O.seller_tag
            and I.buyer_tag  = O.buyer_tag
            and I.isin       = O.isin
            and I.quantity   = O.quantity
            and I.cdsx_time  = O.in_msg_time
            and I.cdsx_time::text like '2020-05-12%'
         ) 
where O.msg_type = 'CDST01C'
and O.cdsx_time::text like '2020-05-12%';

I have tried below solution. Is it the correct approach or is there any loophole?

update prod_replay_out O  
   set in_msg_id = 
        (Select id
           From prod_replay_in I
          Where I.msg_type   = 'CDST010'
            and I.seller     = O.seller
            and I.buyer      = O.buyer
            and I.sender_tag = O.sender_tag
            and I.seller_tag = O.seller_tag
            and I.buyer_tag  = O.buyer_tag
            and I.isin       = O.isin
            and I.quantity   = O.quantity
            and I.cdsx_time  = O.in_msg_time
            and I.cdsx_time::text like '2020-05-12%'
            and 1 = (Select count(id)
                       From prod_replay_in I
                      Where I.msg_type   = 'CDST010'
                        and I.seller     = O.seller
                        and I.buyer      = O.buyer
                        and I.sender_tag = O.sender_tag
                        and I.seller_tag = O.seller_tag
                        and I.buyer_tag  = O.buyer_tag
                        and I.isin       = O.isin
                        and I.quantity   = O.quantity
                        and I.cdsx_time  = O.in_msg_time
                        and I.cdsx_time::text like '2020-05-12%'
                    )    
                )
where O.msg_type = 'CDST01C'
  and O.cdsx_time::text like '2020-05-12%';
like image 619
Jay Parmar Avatar asked Jun 22 '20 15:06

Jay Parmar


People also ask

What if subquery returns multiple rows?

Multiple-row subqueries are nested queries that can return more than one row of results to the parent query. Multiple-row subqueries are used most commonly in WHERE and HAVING clauses. Since it returns multiple rows, it must be handled by set comparison operators (IN, ALL, ANY).

How do you fix more than one row returned by a subquery used as an expression?

Technically, to repair your statement, you can add LIMIT 1 to the subquery to ensure that at most 1 row is returned. That would remove the error, your code would still be nonsense.

How do you resolve ORA 01427 single row subquery returns more than one row?

There are essentially two options to take. The first is to rewrite the query so that the subquery will only return a single row, thus eliminating the source of the error. This will require a user to rethink how they wish to acquire the data that they were initially searching for.

What is the output of this query if the subquery returns no rows?

EXISTS. The argument of EXISTS is an arbitrary SELECT statement, or subquery. The subquery is evaluated to determine whether it returns any rows. If it returns at least one row, the result of EXISTS is “true”; if the subquery returns no rows, the result of EXISTS is “false”.


2 Answers

Query

Most importantly, don't use a correlated subquery. It's the inferior tool for the job. Use a subquery in the FROM clause.

This only updates where a single matching candidate row is found in the source table (neither none nor multiple), and only where it actually changes the value:

UPDATE prod_replay_out o
SET    in_msg_id = i.id
FROM  (
   SELECT i.id, i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time
   FROM   prod_replay_in i
   WHERE  i.msg_type   = 'CDST010'
   AND    i.cdsx_time >= '2020-05-12'     -- ① "sargable" expression
   AND    i.cdsx_time <  '2020-05-13'     -- ② don't cast to date, it's a valid timestamp literal
   AND    NOT EXISTS (                    -- ③ EXISTS is typically faster than counting
      SELECT FROM prod_replay_in x
      WHERE  x.id <> i.id                 -- ④ unique
      AND   (i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time)  -- ⑤ short syntax
        =   (x.seller, x.buyer, x.sender_tag, x.seller_tag, x.buyer_tag, x.isin, x.quantity, x.cdsx_time)
      )
   ) i
WHERE  o.msg_type = 'CDST01C'
AND   (i.seller, i.buyer, i.sender_tag, i.seller_tag, i.buyer_tag, i.isin, i.quantity, i.cdsx_time)
  =   (o.seller, o.buyer, o.sender_tag, o.seller_tag, o.buyer_tag, o.isin, o.quantity, o.in_msg_time)  -- ⑥ o.cdsx_time?
-- AND    o.cdsx_time >= '2020-05-12'     -- ⑦ redundant
-- AND    o.cdsx_time <  '2020-05-13'
AND   o.in_msg_id IS DISTINCT FROM i.id   -- ⑧ avoid empty updates
;

① Like GMB already suggested, transform this predicate to "sargable" expressions. This is faster, generally, and can use index support.

  • What does the word “SARGable” really mean?
  • Calculate number of concurrent events in SQL

② But don't cast to date if cdsx_time is a timestamp column (as seems likely). '2020-05-12' is a perfectly valid timestamp literal, signifying the first instance of the day. See:

  • Generating time series between two dates in PostgreSQL

If it's a timestamptz column, consider the possible influence of the timezone setting! See:

  • Ignoring time zones altogether in Rails and PostgreSQL

EXISTS is typically (much) more efficient than counting all rows, as it can stop as soon as another row is found. Especially if there can be many peers, and index support is available. See:

  • Select rows which are not present in other table

④ Assuming id is unique (or PK). Else use the system column ctid for the job. See:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

⑤ Convenient, equivalent short syntax with ROW values. See:

  • Enforcing index scan for multicolumn comparison

⑥ Your query has:

and I.cdsx_time  = O.in_msg_time         -- !?
and I.cdsx_time::text like '2020-05-12%'

... but:

O.cdsx_time::text like '2020-05-12%'

You didn't mean to write and I.cdsx_time = O.cdsx_time?

⑦ Would be noise. The restriction is already enforced in the subquery. (Doesn't help index support, either.)

⑧ This one is important if some columns may already have the desired value. Then the operation is skipped instead of writing an identical row version at full cost.

If both columns are defined NOT NULL, simplify to o.in_msg_id <> i.id. Again, see:

  • Update a column of a table with a column of another table in PostgreSQL

Indices

If performance is an issue or you run this repeatedly, consider indices like the following:

For the first (in order of the expected query plan!) step of identifying source row candidates:

CREATE INDEX foo ON prod_replay_in (msg_type, cdsx_time); 

For the second step of ruling out duplicates:

CREATE INDEX foo ON prod_replay_in (seller, buyer, sender_tag, seller_tag, buyer_tag, isin, quantity, cdsx_time);

Or any small subset that is selective enough. A smaller index on fewer columns is typically more efficient if it includes relatively few additional rows as "false positives" in the index scan. While relatively few, these are eliminated cheaply in the following FILTER step.

For the final step of identifying target rows:

CREATE INDEX foo ON prod_replay_out (msg_type, in_msg_time);

Again: or any small subset that is selective enough.

like image 64
Erwin Brandstetter Avatar answered Nov 02 '22 22:11

Erwin Brandstetter


You want to update only when the subquery returns one row. One option uses aggreation and having in the subquery:

update prod_replay_out o  
set in_msg_id = (
    select max(id)
    from prod_replay_in i
    where 
        i.msg_type       = 'cdst010'
        and i.seller     = o.seller
        and i.buyer      = o.buyer
        and i.sender_tag = o.sender_tag
        and i.seller_tag = o.seller_tag
        and i.buyer_tag  = o.buyer_tag
        and i.isin       = o.isin
        and i.quantity   = o.quantity
        and i.cdsx_time  = o.in_msg_time
        and i.cdsx_time  >= '2020-05-12'::date
        and i.cdsx_time  <  '2020-05-13'::date
    having count(*) = 1
) 
where 
    o.msg_type = 'cdst01c'
    and o.cdsx_time  >= '2020-05-12'::date
    and o.cdsx_time  <  '2020-05-13'::date
    

Note that I rewrote the date filters to avoid the conversion to text (you can use an half-open interval with date literals instead, which is by far more efficient).

Note that this updates in_msg_id to null when the subquery would have returned multiple rows (or no rows at all). If you want to avoid that, you can filter in the where clause:

update prod_replay_out o  
set in_msg_id = (
    select max(id)
    from prod_replay_in i
    where 
        i.msg_type       = 'cdst010'
        and i.seller     = o.seller
        and i.buyer      = o.buyer
        and i.sender_tag = o.sender_tag
        and i.seller_tag = o.seller_tag
        and i.buyer_tag  = o.buyer_tag
        and i.isin       = o.isin
        and i.quantity   = o.quantity
        and i.cdsx_time  = o.in_msg_time
        and i.cdsx_time  >= '2020-05-12'::date
        and i.cdsx_time  <  '2020-05-13'::date
    having count(*) = 1
) 
where 
    o.msg_type = 'cdst01c'
    and o.cdsx_time  >= '2020-05-12'::date
    and o.cdsx_time  <  '2020-05-13'::date
    and (
        select count(*)
        from prod_replay_in i
        where 
            i.msg_type       = 'cdst010'
            and i.seller     = o.seller
            and i.buyer      = o.buyer
            and i.sender_tag = o.sender_tag
            and i.seller_tag = o.seller_tag
            and i.buyer_tag  = o.buyer_tag
            and i.isin       = o.isin
            and i.quantity   = o.quantity
            and i.cdsx_time  = o.in_msg_time
            and i.cdsx_time  >= '2020-05-12'::date
            and i.cdsx_time  <  '2020-05-13'::date
    ) = 1
like image 34
GMB Avatar answered Nov 02 '22 23:11

GMB