Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass a set of rows from one function into another?

Tags:

sql

postgresql

Overview

I'm using PostgreSQL 9.1.14, and I'm trying to pass the results of a function into another function. The general idea (specifics, with a minimal example, follow) is that we can write:

select * from (select * from foo ...) 

and we can abstract the sub-select away in a function and select from it:

create function foos() 
returns setof foo
language sql as $$
  select * from foo ...
$$;

select * from foos()

Is there some way to abstract one level farther, so as to be able to do something like this (I know functions cannot actually have arguments with setof types):

create function more_foos( some_foos setof foo )
language sql as $$
  select * from some_foos ...  -- or unnest(some_foos), or ???
$$:

select * from more_foos(foos())

Minimal Example and Attempted Workarounds

I'm using PostgreSQL 9.1.14. Here's a minimal example:

-- 1. create a table x with three rows                                                                                                                                                            
drop table if exists x cascade;
create table if not exists x (id int, name text);
insert into x values (1,'a'), (2,'b'), (3,'c');

-- 2. xs() is a function with type `setof x`
create or replace function xs()
returns setof x
language sql as $$
  select * from x
$$;

-- 3. xxs() should return the context of x, too
--    Ideally the argument would be a `setof x`,
--    but that's not allowed (see below).
create or replace function xxs(x[])  
returns setof x
language sql as $$
  select x.* from x
  join unnest($1) y
       on x.id = y.id
$$;

When I load up this code, I get the expected output for the table definitions, and I can call and select from xs() as I'd expect. But when I try to pass the result of xs() to xxs(), I get an error that "function xxs(x) does not exist":

db=> \i test.sql 
DROP TABLE
CREATE TABLE
INSERT 0 3
CREATE FUNCTION
CREATE FUNCTION

db=> select * from xs();
  1 | a
  2 | b
  3 | c

db=> select * from xxs(xs());
ERROR:  function xxs(x) does not exist
LINE 1: select * from xxs(xs());
                      ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

I'm a bit confused about "function xxs(x) does not exist"; since the return type of xs() was setof x, I'd expected that its return type would be setof x (or maybe x[]), not x. Following the complaints about the type, I can get to either of the following , but while with either definition I can select xxs(xs());, I can't select * from xxs(xs());.

create or replace function xxs( x )
returns setof x
language sql as $$
  select x.* from x
  join unnest(array[$1]) y    -- unnest(array[...]) seems pretty bad
       on x.id = y.id
$$;
create or replace function xxs( x )
returns setof x
language sql as $$
  select * from x
         where x.id in ($1.id)
$$;
db=> select xxs(xs());
 (1,a)
 (2,b)
 (3,c)

db=> select * from xxs(xs());
ERROR:  set-valued function called in context that cannot accept a set

Summary

What's the right way to pass the results of a set-returning function into another function? (I have noted that create function … xxs( setof x ) … results in the error: ERROR: functions cannot accept set arguments, so the answer won't literally be passing a set of rows from one function to another.)

like image 474
Joshua Taylor Avatar asked Nov 14 '14 19:11

Joshua Taylor


1 Answers

Table functions

I perform very high speed, complex database migrations for a living, using SQL as both the client and server language (no other language is used), all running server side, where the code rarely surfaces from the database engine. Table functions play a HUGE role in my work. I don't use "cursors" since they are too slow to meet my performance requirements, and everything I do is result set oriented. Table functions have been an immense help to me in completely eliminating use of cursors, achieving very high speed, and have contributed dramatically towards reducing code volume and improving simplicity.

In short, you use a query that references two (or more) table functions to pass the data from one table function to the next. The select query result set that calls the table functions serves as the conduit to pass the data from one table function to the next. On the DB2 platform / version I work on, and it appears based on a quick look at the 9.1 Postgres manual that the same is true there, you can only pass a single row of column values as input to any of the table function calls, as you've discovered. However, because the table function call happens in the middle of a query's result set processing, you achieve the same effect of passing a whole result set to each table function call, albeit, in the database engine plumbing, the data is passed only one row at a time to each table function.

Table functions accept one row of input columns, and return a single result set back into the calling query (i.e. select) that called the function. The result set columns passed back from a table function become part of the calling query's result set, and are therefore available as input to the next table function, referenced later in the same query, typically as a subsequent join. The first table function's result columns are fed as input (one row at a time) to the second table function, which returns its result set columns into the calling query's result set. Both the first and second table function result set columns are now part of the calling query's result set, and are now available as input (one row at a time) to a third table function. Each table function call widens the calling query's result set via the columns it returns. This can go on an on until you start hitting limits on the width of a result set, which likely varies from one database engine to the next.

Consider this example (which may not match Postgres' syntax requirements or capabilities as I work on DB2). This is one of many design patterns in which I use table functions, is one of the simpler ones, that I think is very illustrative, and one that I anticipate would have broad appeal if table functions were in heavy mainstream use (to my knowledge they are not, but I think they deserve more attention than they are getting).

In this example, the table functions in use are: VALIDATE_TODAYS_ORDER_BATCH, POST_TODAYS_ORDER_BATCH, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH. On the DB2 version I work on, you wrap the table function inside "TABLE( place table function call and parameters here )", but based on quick look at a Postgres manual it appears you omit the "TABLE( )" wrapper.

create table TODAYS_ORDER_PROCESSING_EXCEPTIONS as (

select      TODAYS_ORDER_BATCH.*
           ,VALIDATION_RESULT.ROW_VALID
           ,POST_RESULT.ROW_POSTED
           ,WAREHOUSE_RESULT.ROW_WAREHOUSED

from        TODAYS_ORDER_BATCH

cross join  VALIDATE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function]  ) 
              as VALIDATION_RESULT ( ROW_VALID )  --example: 1/0 true/false Boolean returned

left join   POST_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] )
              as POST_RESULT ( ROW_POSTED )  --example: 1/0 true/false Boolean returned
      on    ROW_VALIDATED = '1'

left join   DATA_WAREHOUSE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] )
              as WAREHOUSE_RESULT ( ROW_WAREHOUSED )  --example: 1/0 true/false Boolean returned
      on    ROW_POSTED = '1'

where       coalesce( ROW_VALID,      '0' ) = '0'   --Capture only exceptions and unprocessed work.  
      or    coalesce( ROW_POSTED,     '0' ) = '0'   --Or, you can flip the logic to capture only successful rows.
      or    coalesce( ROW_WAREHOUSED, '0' ) = '0'

) with data
  1. If table TODAYS_ORDER_BATCH contains 1,000,000 rows, then VALIDATE_TODAYS_ORDER_BATCH will be called 1,000,000 times, once for each row.
  2. If 900,000 rows pass validation inside VALIDATE_TODAYS_ORDER_BATCH, then POST_TODAYS_ORDER_BATCH will be called 900,000 times.
  3. If only 850,000 rows successfully post, then VALIDATE_TODAYS_ORDER_BATCH needs some loopholes closed LOL, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH will be called 850,000 times.
  4. If 850,000 rows successfully made it into the Data Warehouse (i.e. no additional exceptions were generated), then table TODAYS_ORDER_PROCESSING_EXCEPTIONS will be populated with 1,000,000 - 850,000 = 150,000 exception rows.

The table function calls in this example are only returning a single column, but they could be returning many columns. For example, the table function validating an order row could return the reason why an order failed validation.

In this design, virtually all the chatter between a HLL and the database is eliminated, since the HLL requestor is asking the database to process the whole batch in ONE request. This results in a reduction of millions of SQL requests to the database, in a HUGE removal of millions of HLL procedure or method calls, and as a result provides a HUGE runtime improvement. In contrast, legacy code which often processes a single row at a time, would typically send 1,000,000 fetch SQL requests, 1 for each row in TODAYS_ORDER_BATCH, plus at least 1,000,000 HLL and/or SQL requests for validation purposes, plus at least 1,000,000 HLL and/or SQL requests for posting purposes, plus 1,000,000 HLL and/or SQL requests for sending the order to the data warehouse. Granted, using this table function design, inside the table functions SQL requests are being sent to the database, but when the database makes requests to itself (i.e from inside a table function), the SQL requests are serviced much faster (especially in comparison to a legacy scenario where the HLL requestor is doing single row processing from a remote system, with the worst case over a WAN - OMG please don't do that).

You can easily run into performance problems if you use a table function to "fetch a result set" and then join that result set to other tables. In that case, the SQL optimizer can't predict what set of rows will be returned from the table function, and therefore it can't optimize the join to subsequent tables. For that reason, I rarely use them for fetching a result set, unless I know that result set will be a very small number of rows, hence not causing a performance problem, or I don't need to join to subsequent tables.

In my opinion, one reason why table functions are underutilized is that they are often perceived as only a tool to fetch a result set, which often performs poorly, so they get written off as a "poor" tool to use.

Table functions are immensely useful for pushing more functionality over to the server, for eliminating most of the chatter between the database server and programs on remote systems, and even for eliminating chatter between the database server and external programs on the same server. Even chatter between programs on the same server carries more overhead than many people realize, and much of it is unnecessary. The heart of the power of table functions lies in using them to perform actions inside result set processing.

There are more advanced design patterns for using table functions that build on the above pattern, where you can maximize result set processing even further, but this post is a lot for most to absorb already.

like image 120
Mike Jones Avatar answered Oct 06 '22 00:10

Mike Jones