Preventing insertion of duplicates without using indices

Tags:

I have a MariaDB table users that looks roughly like this:

id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON

For privacy reasons, I cannot store actual emails in the database.

The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.

There are already data in the database and changing the encryption method or the hashing method is out of question.

The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.

How can I prevent two entries with the same email from appearing in the database?

Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/

LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.

LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.

(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)

Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.

Currently, I'm doing something like this (pseudocode):

perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
    if decrypt(conflict) == new_user.email :
        perform "ROLLBACK"
        return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK

which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?

This is how the race looks like, simplified:

Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.

366

asked Oct 04 '19 19:10

Karol S

1 Answers

Tack FOR UPDATE on the end of the SELECT.

Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)

answered Sep 17 '22 22:09

Rick James

Related questions
                            
                                Apache needed for NodeJs?
                            
                                Dapper MySQL return value
                            
                                how to insert tab delimited file into mysql with relation
                            
                                Sqlalchemy mysql FLOAT precision and length
                            
                                changing ownership of '/var/lib/mysql/': Permission denied
                            
                                How to connect laravel project with xampp database
                            
                                How can I connect php-apache and MySQL using Docker?
                            
                                Convert a complex SQL query to SQLAlchemy
                            
                                Laravel: How to update the MySQL by Eloquent in child process?
                            
                                I can't drop unique index
                            
                                SQL Insert multiple record while using ON DUPLICATE KEY UPDATE
                            
                                mysqldump compatible mode postgresql is not working
                            
                                MySQL, Select records based on values in JSON array
                            
                                Configuring .my.cnf in home directory for multiple databases does not work. Works for single database
                            
                                Yii2 : Add ON UPDATE CURRENT_TIMESTAMP attribute
                            
                                Setup spring-boot project with mysql database using mysql driver
                            
                                Setting dynamic database in config during Login - Laravel
                            
                                How to get everything before the last occurrence of a character in MySQL?
                            
                                MediaWiki Docker Official Image - Connection Refused by MySQL
                            
                                Magento 2 Collection Date Filter out by one hour

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Preventing insertion of duplicates without using indices

Tags:

sql

indexing

mysql

mariadb

Karol S

People also ask

1 Answers

Rick James

Recent Activity

Donate For Us