Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert row if not exists leads to race condition?

I am implementing a simple web based RSS reader using python (not really relevant) and Postgresql (9.2 if relevant). The database schema is as follows (based on the RSS format):

CREATE TABLE feed_channel
(
    id SERIAL PRIMARY KEY,
    name TEXT,
    link TEXT NOT NULL,
    title TEXT
);
CREATE TABLE feed_content
(
    id SERIAL PRIMARY KEY,
    channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE,
    guid TEXT UNIQUE NOT NULL,
    title TEXT,
    link TEXT,
    description TEXT,
    pubdate TIMESTAMP
);

When I create a new channel (and also query for updated feed info) I request the feed, insert its data to the feed_channel table, selects the newly inserted ID - or existing to avoid duplicates - and then add the feed data to the feed_content table. A typical scenario would be:

  1. Query the feed url, grab feed headers and all current content
  2. Insert the feed headers into feed_channel if not exists... if already exists, grab the existing ID
  3. For each feed item, insert into the feed_content table with a reference to the stored channel ID

This is a standard "insert if not already exists, but return relevant ID" problem. To solve this I have implemented the following stored procedure:

CREATE OR REPLACE FUNCTION channel_insert(
  p_link feed_channel.link%TYPE,
  p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $$
  DECLARE
    v_id feed_channel.id%TYPE;
  BEGIN
    SELECT id
    INTO v_id
    FROM feed_channel
    WHERE link=p_link AND title=p_title
    LIMIT 1;

    IF v_id IS NULL THEN
      INSERT INTO feed_channel(name,link,title)
      VALUES (DEFAULT,p_link,p_title)
      RETURNING id INTO v_id;
    END IF;

    RETURN v_id;

  END;
$$ LANGUAGE plpgsql;

This is then called as "select channel_insert(link, title);" from my application to insert if not already exists and then return the ID of the relevant row regardless of whether it was inserted or just found (step 2 in list above).

This works great!

However, I recently started wondering what would happen if this procedure was executed twice at the same time with the same arguments. Lets assume the following:

  1. User 1 attempts to add a new channel and thereby execute channel_insert
  2. A few ms later, User 2 attempts to add the same channel and also execute channel_insert
  3. User 1's check for existing rows completes, but before the insert is complete, User 2's check completes and says there are no existing rows.

Will this be a potential race condition in PostgreSQL? What is the best way to solve this problem to avoid such scenarios? Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?

One option I tried was to make the fields Unique and then attempt to insert first, and if exception, select the existing instead... This worked, however, the SERIAL field would increment for each attempt, leaving a lot of gaps in the sequence. I don't know if that would be a problem in the long run (probably not), but kind of annoying. Perhaps this is the preferred solution?

Thanks for any feedback. This level of PostgreSQL magic is beyond me, so any feedback would be appreciated.

like image 895
agnsaft Avatar asked Dec 21 '12 12:12

agnsaft


3 Answers

Will this be a potential race condition in PostgreSQL?

Yes, and in fact it would be in any database engine.

What is the best way to solve this problem to avoid such scenarios?

This is a loaded question and would require intimate knowledge of the usage of the database by multiple users. However, I'm going to give you some options. In short, the only option you have is to LOCK the table during this process, but how you lock that table will depend on how the database is used throughout the day.

Let's start with the basic LOCK:

LOCK TABLE feed_channel

That will lock the table using the ACCESS EXCLUSIVE lock option.

Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.

Now, that is the most restrictive lock available, and would certainly resolve the race condition, but may not be exactly what you want. That's something you're going to have to decide. So, though it's clear you're going to have to LOCK the table, it's not clear how.

What are you left to decide?

  1. How do you want to LOCK the table? Study the locking options at that link to make your determination.
  2. Where do you want to LOCK the table? Or in other words, do you want to LOCK at the top of the function (which I think you do based on the possible race condition), or do you simply want to LOCK right before the INSERT?

Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?

No, the code can be executed by anybody that's connected to the database.


I hope that this has helped direct you.

like image 189
Mike Perrenoud Avatar answered Nov 14 '22 06:11

Mike Perrenoud


There is an unavoidable "race" here, since two sessions cannot "see" eachothers uncommited rows. On a conflict, a session could only rollback (maybe to a savepoint) and retry. That would typically mean: referring to the other's freshly inserted row, instead of creating a private duplicate.

There is a data-modelling problem here: feed_channel appears to have a lot of candidate keys, and the cascading rule from feed_content could orphanise a lot of feed_content's rows (I suppose content-> channel is 1::M relation; more than one contents-row could refer to the same channel)

Finally, the feed_channel table at least needs the natural key {link,title}. That is where the insert/not exists is all about. (and the whole purpose of this function)

I cleaned up the function a bit. The IF construct is not needed, doing an INSERT WHERE NOT EXISTS first works just as well, and maybe even better.

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE feed_channel
    ( id SERIAL PRIMARY KEY
    , name TEXT
    , link TEXT NOT NULL
    , title TEXT NOT NULL -- part of PK :: must be not nullable
    , CONSTRAINT feed_channel_nat UNIQUE (link,title) -- the natural key
);

CREATE TABLE feed_content
    ( id SERIAL PRIMARY KEY
    , channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE
    , guid TEXT UNIQUE NOT NULL -- yet another primary key
    , title TEXT --
    , link TEXT  -- title && link appear to be yet another candidate key
    , description TEXT
    , pubdate TIMESTAMP
    );

-- NOTE: omitted original function channel_insert() for brevity
CREATE OR REPLACE FUNCTION channel_insert_wp(
  p_link feed_channel.link%TYPE,
  p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $body$
   DECLARE
    v_id feed_channel.id%TYPE;
  BEGIN
      INSERT INTO feed_channel(link,title)
      SELECT p_link,p_title
      WHERE NOT EXISTS ( SELECT *
        FROM feed_channel nx
        WHERE nx.link= p_link
        AND nx.title= p_title
        )
        ;
    SELECT id INTO v_id
    FROM feed_channel ex
    WHERE ex.link= p_link
    AND ex.title= p_title
        ;

    RETURN v_id;

  END;
$body$ LANGUAGE plpgsql;

SELECT channel_insert('Bogus_link', 'Bogus_title');
SELECT channel_insert_wp('Bogus_link2', 'Bogus_title2');

SELECT * FROM feed_channel;

Results:

DROP SCHEMA
CREATE SCHEMA
SET
NOTICE:  CREATE TABLE will create implicit sequence "feed_channel_id_seq" for serial column "feed_channel.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "feed_channel_pkey" for table "feed_channel"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "feed_channel_nat" for table "feed_channel"
CREATE TABLE
NOTICE:  CREATE TABLE will create implicit sequence "feed_content_id_seq" for serial column "feed_content.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "feed_content_pkey" for table "feed_content"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "feed_content_guid_key" for table "feed_content"
CREATE TABLE
NOTICE:  type reference feed_channel.link%TYPE converted to text
NOTICE:  type reference feed_channel.title%TYPE converted to text
NOTICE:  type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
NOTICE:  type reference feed_channel.link%TYPE converted to text
NOTICE:  type reference feed_channel.title%TYPE converted to text
NOTICE:  type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
 channel_insert 
----------------
              1
(1 row)

 channel_insert_wp 
-------------------
                 2
(1 row)

 id | name |    link     |    title     
----+------+-------------+--------------
  1 |      | Bogus_link  | Bogus_title
  2 |      | Bogus_link2 | Bogus_title2
(2 rows)
like image 28
wildplasser Avatar answered Nov 14 '22 04:11

wildplasser


Your foremost problem is that a serial does not make a good primary key for the feed_channel table. The primary key should be (link, title) or just (link) if title can be null. Then any try to insert an existent feed would raise a primary key error.

BTW v_id will be null whenever title is null:

WHERE link=p_link AND title=p_title
like image 3
Clodoaldo Neto Avatar answered Nov 14 '22 04:11

Clodoaldo Neto