Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preventing insertion of duplicates without using indices

I have a MariaDB table users that looks roughly like this:

id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON

For privacy reasons, I cannot store actual emails in the database.

The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.

There are already data in the database and changing the encryption method or the hashing method is out of question.

The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.

How can I prevent two entries with the same email from appearing in the database?

Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/

LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.

LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.

(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)


Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.

Currently, I'm doing something like this (pseudocode):

perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
    if decrypt(conflict) == new_user.email :
        perform "ROLLBACK"
        return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK

which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?


This is how the race looks like, simplified:

  • Two threads start their transactions

  • Both threads do the select and the select returns zero rows in both cases.

  • Both threads assume there won't be a duplicate.

  • Both threads add the user.

  • Both threads commit the transactions.

  • There are now two users with the same email.

like image 366
Karol S Avatar asked Oct 04 '19 19:10

Karol S


People also ask

How can we prevent insertion of duplicate data?

You can use a PRIMARY KEY or a UNIQUE Index on a table with the appropriate fields to stop duplicate records. Let us take an example – The following table contains no such index or primary key, so it would allow duplicate records for first_name and last_name.

Which key is used to avoid duplicate values?

You can prevent duplicate key values in your files by specifying the UNIQUE keyword in data description specifications (DDS). With the UNIQUE keyword specified, a record cannot be entered or copied into a file if its key value is the same as the key value of a record already existing in the file.


1 Answers

Tack FOR UPDATE on the end of the SELECT.

Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)

like image 79
Rick James Avatar answered Sep 17 '22 22:09

Rick James