Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres 9.3: Sharelock issue with simple INSERT

Update: Potential solution below

I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.

Problem: I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error

TransactionRollbackError: deadlock detected
DETAIL: Process 26755 waits for ShareLock on transaction 689456; blocked by process 26754. Process 26754 waits for ShareLock on transaction 689467; blocked by process 26755.

I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all.

Thanks for reading!

Example config file:

example_key this_is_the_value
other_example other_value
third example yet_another_value

Table definitions:

    CREATE TABLE keys (
        id SERIAL PRIMARY KEY,
        hash UUID UNIQUE NOT NULL,
        key TEXT);

    CREATE TABLE values (
        id SERIAL PRIMARY KEY,
        hash UUID UNIQUE NOT NULL,
        key TEXT);

    CREATE TABLE keyvalue_pairs (
        id SERIAL PRIMARY KEY,
        file_id INTEGER REFERENCES filenames,
        key_id INTEGER REFERENCES keys,
        value_id INTEGER REFERENCES values);

SQL Statements:

Initially I was trying to use this statement to avoid any exceptions:

    WITH s AS (
        SELECT id, hash, key FROM keys
            WHERE hash = 'hash_value';
    ), i AS (
        INSERT INTO keys (hash, key)
        SELECT 'hash_value', 'key_value'
        WHERE NOT EXISTS (SELECT 1 FROM s)
        returning id, hash, key
    )
    SELECT id, hash, key FROM i
    UNION ALL
    SELECT id, hash, key FROM s;

But even something as simple as this causes the deadlocks:

    INSERT INTO keys (hash, key)
        VALUES ('hash_value', 'key_value')
        RETURNING id;
  • In both cases, if I get an exception thrown because the inserted hash value is not unique, I use savepoints to rollback the change and another statement to just select the id I'm after.
  • I'm using hashes for the unique field, as some of the keys and values are too long to be indexed

Full example of the python code (using psycopg2) with savepoints:

key_value = 'this_key'
hash_val = generate_uuid(value)
try:
    cursor.execute(
        '''
        SAVEPOINT duplicate_hash_savepoint;
        INSERT INTO keys (hash, key)
            VALUES (%s, %s)
            RETURNING id;
        '''
        (hash_val, key_value)
    )

    result = cursor.fetchone()[0]
    cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
    return result
except psycopg2.IntegrityError as e:
    cursor.execute(
        '''
        ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
        '''
    )

    #TODO: Should ensure that values match and this isn't just
    #a hash collision

    cursor.execute(
        '''
        SELECT id FROM keys WHERE hash=%s LIMIT 1;
        '''
        (hash_val,)
    )
    return cursor.fetchone()[0]

Update: So I believe I a hint on another stackexchange site:

Specifically:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time1. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it2, otherwise it will attempt to apply its operation to the updated version of the row.

While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.

like image 746
JBeFat Avatar asked Feb 06 '15 14:02

JBeFat


1 Answers

It looks like you're in this situation:

  1. The table to INSERT into has a primary key (or unique index(es) of any sort).
  2. Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
  3. The rows to insert come in random order (with regard to the primary key)
  4. The rows are inserted in concurrent transactions.

This situation creates the following opportunity for deadlock:

Assuming there are two sessions, that each started a transaction.

  1. Session #1: insert row with PK 'A'
  2. Session #2: insert row with PK 'B'
  3. Session #1: try to insert row with PK 'B' => Session #1 is put to wait until Session #2 commits or rollbacks
  4. Session #2: try to insert row with PK 'A' => Session #2 is put to wait for Session #1.

Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.

If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.

like image 71
Daniel Vérité Avatar answered Nov 05 '22 16:11

Daniel Vérité