Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres bulk INSERT function using JSON arguments

Here's a plpgsql function for postgres 9.6. It tries to INSERT a row, and if the insert doesn't fail (due to a key constraint violation), then it runs a few more commands.

CREATE FUNCTION foo(int, text, text)
  RETURNS void AS
$$
BEGIN
  INSERT INTO table1 (id, val1, val2) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING;
  IF FOUND THEN
    INSERT INTO table2 (table1_id, val1) VALUES ($1, $2);
    UPDATE table3 SET (val2, time) = ($3, now()) WHERE table1_id = $1;
  END IF;
END
$$

This function processes a single record, but how could you modify it to process a batch of thousands of records?

I found an answer, which suggests to make each of the 3 function arguments an array. But is there a way to do it where I'm passing in arguments that more closely represent how the records would look in my application?

For example, the ideal solution would be my application code calls select foo($1), where the parameter $1 is a JSON array of objects, where each inner object is a record to be inserted.

[ 
  { "id": "1", "val1": "1-val1", "val2": "1-val2" },
  { "id": "2", "val1": "2-val1", "val2": "2-val2" },
  { "id": "3", "val1": "3-val1", "val2": "3-val2" },
  { "id": "4", "val1": "4-val1", "val2": "4-val2" }
]

The second-best option would be my application code calls select foo($1, $2, $3, $4), where each parameter is a JSON object corresponding to a record to be inserted.

{ "id": "1", "val1": "1-val1", "val2": "1-val2" }  // This would be $1
{ "id": "2", "val1": "2-val1", "val2": "2-val2" }  // This would be $2

I'm looking at the various JSON functions offered by Postgres here and they seem relevant to this but I can't figure out which exactly to use. Is what I'm looking to do even possible? Would using JSON arrays instead of JSON objects anywhere make this possible?

like image 464
user779159 Avatar asked Jan 05 '17 19:01

user779159


1 Answers

For thousands of records

1. Create a temporary table of input rows, comprised of your values $1, $2, $3. The fastest way to upload is COPY - or the \copy meta-command of psql if the data is not on the same machine. Let's suppose this table:

CREATE TEMP TABLE tmp(id int PRIMARY KEY, val1 text, val2 text);

I added a PK constraint, which is totally optional, but it makes sure we are dealing with unique not-null int values. If you can vouch for input data, you don't need the constraint.

2. Chain your commands with data-modifying CTEs. As we have determined under your previous question, there are no race conditions to take care of in this particular operation.

WITH ins1 AS (
   INSERT INTO table1 AS t1 (id, val1, val2)
   SELECT id, val1, val2 FROM tmp ON CONFLICT DO NOTHING
   RETURNING t1.id, t1.val1, t1.val2  -- only actually inserted rows returned
   )
, ins2 AS (
   INSERT INTO table2 (table1_id, val1)
   SELECT id, val1 FROM ins1
   )
UPDATE table3 t3
SET    val2 = i.val2
     , time = now()
FROM   ins1 i
WHERE  t3.table1_id = i.id;

Step 1. and 2. must must run in the same session (not necessarily the same transaction), since the scope of temp tables is bound to the same session.

Note, the UPDATE only depends on the 1st INSERT, success of the 2nd INSERT is guaranteed, since there is no ON CONFLICT DO NOTHING and the whole operation would be rolled back if there is any conflict in the 2nd INSERT.

Related:

  • COPY command: copy only specific columns from csv

For just a couple of records

There are various options how. Your idea to pass a JSON array to a function is one of them. If objects match the target table, you can use json_populate_recordset() in a single INSERT query. Or just use the INSERT (as prepared statement) without function wrapper.

INSERT INTO target_tbl  -- it's ok to omit target columns here
SELECT *
FROM   json_populate_recordset(null::target_tbl,  -- use same table type
          json '[{ "id": "1", "val1": "1-val1", "val2": "1-val2" },
                 { "id": "2", "val1": "2-val1", "val2": "2-val2" },
                 { "id": "3", "val1": "3-val1", "val2": "3-val2" },
                 { "id": "4", "val1": "4-val1", "val2": "4-val2" }]');

For just a handful of columns you might also pass an array for each column and loop through them in parallel. You can do this with a simple loop on the array index. Since Postgres 9.4 there is also the convenient unnest() with multiple parameters to do it all in a single query:

  • Unnest multiple arrays in parallel

The best solution depends on the data format you have.

like image 80
Erwin Brandstetter Avatar answered Nov 14 '22 23:11

Erwin Brandstetter