Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct use of cursors for very large result sets in Postgres

Short version of my question:

If I hold a cursor reference to an astronomically huge result set in my client code, would it be ridiculous (i.e. completely defeats the point of cursors) to issue "FETCH ALL FROM cursorname" as my next command? Or would this slowly stream the data back to me as I consume it (at least in principle, assuming that I have a well written driver sitting between me and Postgres)?

More detail

If I understand things at all correctly, then Postgres cursors are REALLY for dealing with the following problem [even though they can be used (abused?) for other things, such as returning multiple different result sets from one function]:

Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated.

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-control-structures.html)

But (again if I understand correctly) when you write a function which returns a cursor then the whole query is NOT buffered into memory (and disk) before the user of the function can start to consume anything, but instead the results can be consumed bit by bit. (There is more overhead setting up and using the cursor, but it's worth it to avoid massive buffer allocation for very large result sets.)

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551)

I would like to understand how this relates to SELECTS and FETCHES over the wire to a Postgres server.

In all cases, I'm talking about consuming results from client code which is communicating with Postgres on a socket behind the scenes (using the Npgsql library in my case, actually).

Q1: What if I try to execute "SELECT * FROM AstronomicallyLargeTable" as my only command over the wire to Postgres? Will that allocate all the memory for the entire select and then start to send data back to me? Or will it (effectively) generate its own cursor and stream the data back a little at a time (with no huge additional buffer allocation on the server)?

Q2: What if I already have a cursor reference to an astronomically large result set (say because I've already done one round trip, and got back the cursor reference from some function), and then I execute "FETCH ALL FROM cursorname" over the wire to Postgres? Is that stupid, because it will allocate ALL the memory for all the results on the Postgres server before sending anything back to me? Or will "FETCH ALL FROM cursorname" actually work as I'd like it to, streaming the data back slowly as I consume it, without any massive buffer allocation happening on the Postgres server?

EDIT: Further clarification

I'm asking about a case where I know that my data access layer streams the data from the server to me one row at a time (so no large client-side buffers involved there, however long the data streams for) and where I also know that my own application consumes the data one row at a time and then discards it (so no client-side buffers there, either). I definitely DON'T want to fetch all these rows into client side memory, and then do something with them. I see that that would be completely daft!

So I think all the issues (for the just-described use case) are about how long PostgreSQL would take to start streaming and how much of a memory buffer it would allocate, for a FETCH ALL. IF (and it's a big 'IF'...) PostgreSQL doesn't allocate a huge buffer of all rows before starting, and if it streams the rows back to Npgsql one at a time, starting quickly, then I believe (but please tell me why/if I'm wrong) that there still IS a clear use case for FETCH ALL FROM cursorname!

like image 581
MikeBeaton Avatar asked Feb 17 '17 07:02

MikeBeaton


1 Answers

After some experimenting it seems like PostgreSQL behaves like this:

  • Fetching many rows with SELECT * FROM large will not create a temporary file on the server side, the data are streamed as they are scanned.

  • If you create a server side cursor with a function that returns refcursor and fetch rows from the cursor, all returned rows are collected on the server first. This leads to the creation of a temporary file if you run FETCH ALL.

Here are my experiments with a table that contains 1000000 rows. work_mem is set to 64kb (the minimum). log_temp_files is set to 0 so that temporary files are reported in the server log.

  • First attempt:

    SELECT id FROM large;
    

    Result: no temporary file is created.

  • Second attempt:

    CREATE OR REPLACE FUNCTION lump() RETURNS refcursor
       LANGUAGE plpgsql AS
    $$DECLARE
       c CURSOR FOR SELECT id FROM large;
    BEGIN
       c := 'c';
       OPEN c;
       RETURN c;
    END;$$;
    
    BEGIN;
    SELECT lump();
     lump
    ------
     c
    (1 row)
    
    FETCH NEXT FROM c;
     id
    ----
      1
    (1 row)
    
    FETCH NEXT FROM c;
     id
    ----
      2
    (1 row)
    
    COMMIT;
    

    Result: no temporary file is created.

  • Third attempt:

    BEGIN;
    SELECT lump();
     lump
    ------
     c
    (1 row)
    
    FETCH all FROM c;
       id
    ---------
           1
           2
           3
    ...
      999999
     1000000
    (1000000 rows)
    
    COMMIT;
    

    Result: a temporary file of about 140MB is created.

I don't really know why PostgreSQL behaves that way.

like image 150
Laurenz Albe Avatar answered Sep 22 '22 11:09

Laurenz Albe