I have a MariaDB table users
that looks roughly like this:
id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON
For privacy reasons, I cannot store actual emails in the database.
The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email
column, as it will never catch a duplicate.
There are already data in the database and changing the encryption method or the hashing method is out of question.
The email_hash
column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.
How can I prevent two entries with the same email from appearing in the database?
Another limitation: I probably cannot use LOCK TABLE
, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/
LOCK TABLES
doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.
LOCK TABLES
implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired withLOCK TABLES
.
(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)
Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.
Currently, I'm doing something like this (pseudocode):
perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
if decrypt(conflict) == new_user.email :
perform "ROLLBACK"
return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK
which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?
This is how the race looks like, simplified:
Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.
You can use a PRIMARY KEY or a UNIQUE Index on a table with the appropriate fields to stop duplicate records. Let us take an example – The following table contains no such index or primary key, so it would allow duplicate records for first_name and last_name.
You can prevent duplicate key values in your files by specifying the UNIQUE keyword in data description specifications (DDS). With the UNIQUE keyword specified, a record cannot be entered or copied into a file if its key value is the same as the key value of a record already existing in the file.
Tack FOR UPDATE
on the end of the SELECT
.
Also, since you are using Galera, you must check for errors after COMMIT
. (That is when conflicts with the other nodes are reported.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With