Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Move or Alter a Pipe without missing or duplicating any records

Tags:

This page on managing pipes suggests a process for altering the copy into statement in a pipe.

  1. Pause the pipe (using ALTER PIPE … SET PIPE_EXECUTION_PAUSED=true).
  2. Query the SYSTEM$PIPE_STATUS function and verify that the pipe execution state is PAUSED and the pending file count is 0.
  3. Recreate the pipe to change the COPY statement in the definition. Choose either of the following options: Drop the pipe (using DROP PIPE) and create it (using CREATE PIPE). Recreate the pipe (using the CREATE OR REPLACE PIPE syntax). Internally, the pipe is dropped and created.
  4. Query the SYSTEM$PIPE_STATUS function again and verify that the pipe execution state is RUNNING.

However, if a file should be loaded during the time between pausing and recreating the pipe, there's no step here to refresh for that gap. Even if these steps happen quickly, we've had examples of files getting missed.

Running an ALTER PIPE REFRESH though, causes duplicates because copy history is tied to the pipe. The recreated pipe doesn't have this history and will go back and reload everything.

Is there a nice way to script a change like this out to guarantee that there's no gaps or overlap? Something like getting the timestamp for when the original pipe was paused, and then using that timestamp in a refresh query?

Update: We built a full process and combination of scripts to handle our scenarios. Full script Included in answer below.

like image 408
David Garrison Avatar asked Mar 17 '20 18:03

David Garrison


1 Answers

Unfortunately, this is a current limitation of auto ingest snowpipe, and there is no great workaround.

You are absolutely right about history being tied to a specific pipe. When the pipe is recreated, technically you have a new pipe with new history. So, ALTER PIPE REFRESH will reload everything under the specified prefix again, resulting in duplicate data.

You could use the “MODIFIED_DATE” option of ALTER PIPE REFRESH and specify a time that is equal to the last pipe_recieve_time, but you might miss some notifications, as there is no incoming order guaranteed.

To minimize the amount of duplicate data, you could organize your data files in more granular date time structure (ex. year/month/day/hour/minute/...) and refresh only the folders corespondent to the time where your pipe was “down”. However, there could be some files that were loaded already by the previous pipe, resulting in some duplicate data.

like image 70
Gjorgji Avatar answered Oct 02 '22 15:10

Gjorgji