Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get identity of row inserted in Snowflake Datawarehouse

If I have a table with an auto-incrementing ID column, I'd like to be able to insert a row into that table, and get the ID of the row I just created. I know that generally, StackOverflow questions need some sort of code that was attempted or research effort, but I'm not sure where to begin with Snowflake. I've dug through their documentation and I've found nothing for this.

The best I could do so far is try result_scan() and last_query_id(), but these don't give me any relevant information about the row that was inserted, just confirmation that a row was inserted.

I believe what I'm asking for is along the lines of MS SQL Server's SCOPE_IDENTITY() function.

Is there a Snowflake equivalent function for MS SQL Server's SCOPE_IDENTITY()?

EDIT: for the sake of having code in here:

CREATE TABLE my_db..my_table
(
    ROWID INT IDENTITY(1,1),
    some_number INT,
    a_time TIMESTAMP_LTZ(9),
    b_time TIMESTAMP_LTZ(9),
    more_data VARCHAR(10)
);
INSERT INTO my_db..my_table
(
    some_number,
    a_time,
    more_data
)
VALUES
(1, my_time_value, some_data);

I want to get to that auto-increment ROWID for this row I just inserted.

like image 201
Joshua Schlichting Avatar asked Dec 18 '18 17:12

Joshua Schlichting


People also ask

How do you get Rowid in a Snowflake?

The Oracle ROWID is the physical location of a row. This concept does not exist in the Snowflake database, so there is not a need to generate it. However, if you want to migrate data from Oracle and keep the ROWID column, darren's suggestion above should work, load it into a VARCHAR datatype.

Does Snowflake have identity column?

Snowflake utilizes an internal sequence to generate the values for the identity column, and it does not guarantee to generate gap-free sequence numbers.

What is identity column in Snowflake?

In Snowflake, you can set the default value for a column, which is typically used to set an autoincrement or identity as the default value, so that each time a new row is inserted a unique id for that row is generated and stored and can be used as a primary key.


1 Answers

NOTE: The answer below can be not 100% correct in some very rare cases, see the UPDATE section below

Original answer

Snowflake does not provide the equivalent of SCOPE_IDENTITY today.

However, you can exploit Snowflake's time travel to retrieve the maximum value of a column right after a given statement is executed.

Here's an example:

create or replace table x(rid int identity, num int);
insert into x(num) values(7);
insert into x(num) values(9);
-- you can insert rows in a separate transaction now to test it
select max(rid) from x AT(statement=>last_query_id());
----------+
 MAX(RID) |
----------+
 2        |
----------+

You can also save the last_query_id() into a variable if you want to access it later, e.g.

insert into x(num) values(5);
set qid = last_query_id();
...
select max(rid) from x AT(statement=>$qid);

Note - it will be usually correct, but if the user e.g. inserts a large value into rid manually, it might influence the result of this query.

UPDATE

Note, I realized the code above might rarely generate incorrect answer.

Since the execution order of various phases of a query in a distributed system like Snowflake can be non-deterministic, and Snowflake allows concurrent INSERT statements, the following might happen

  • Two queries, Q1 and Q2, do a simple single row INSERT, start at roughly the same time
  • Q1 starts, is a bit ahead
  • Q2 starts
  • Q1 creates a row with value 1 from the IDENTITY column
  • Q2 creates a row with value 2 from the IDENTITY column
  • Q2 gets ahead of Q1 - this is the key part
  • Q2 commits, is marked as finished at time T2
  • Q1 commits, is marked as finished at time T1

Note that T1 is later than T2. Now, when we try to do SELECT ... AT(statement=>Q1), we will see the state as-of T1, including all changes from statements before, hence including the value 2 from Q2. Which is not what we want.

The way around it could be to add a unique identifier to each INSERT (e.g. from a separate SEQUENCE object), and then use a MAX.

Sorry. Distributed transactions are hard :)

like image 85
Marcin Zukowski Avatar answered Nov 16 '22 12:11

Marcin Zukowski