Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid duplicate key exceptions efficiently

I'm facing the following problem:

I'm trying to keep a table in sql-server synchronized with multiple external databases. These external databases do not have a shared unique primary key so the local table has a simple integer PK.

Now to keep the local table up to date the following is done:

  1. External databases are queried.
  2. Data is converted into valid data for local table.
  3. An insert is used to try to write the data to the local table.
  4. If insert returns a duplicate entry exception, the PK will be searched by a select query and the data will be written to table by an update query.
  5. Another table is modified using the PK of the inserted or updated row.

Now this works fine but to me it seems very inefficient. Most of the time the data is already in the local table and results in a duplicate key exception on the insert. This means lots of exceptions that need to be handled, which is expensive. Also, because of the PK's being managed by the DB a select query must be used to find the row to be updated.

How can I avoid this effect? I do not want to use a stored procedure as I like to keep the query manageable by code and included in version control.

I've looked at merge but I've seen too many people reporting issues with it.

I think I need to use a form of upsert but I'm not sure how to do this with the PK's being managed by the DB.

tl;dr: What I need is a query that will allow me to either insert or update a row(depending on duplicate key or not) that will always return the PK of the row.

like image 611
Mike van Leeuwen Avatar asked Nov 01 '22 15:11

Mike van Leeuwen


1 Answers

I have an implementation that I've done in the past that I like. You may or may not find it useful.

This is how it works... I load both external and local data into memory using a model object that will work for both. For example...

public class Person
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string PhoneNumber { get; set; }
    public string Address { get; set; }

    // This comparer will be used to find records that exist or don't exist.
    public class KeyFieldComparer : IEqualityComparer<Person>
    {
        public bool Equals(Person p1, Person p2)
        {
            return p1.FirstName == p2.FirstName && p1.LastName == p2.LastName;
        }

        public int GetHashCode(Person p)
        {
            return p.FirstName.GetHashCode() ^ p.LastName.GetHashCode();
        }
    }

    // This comparer will be used to find records that are outdated and need to be updated.
    public class OutdatedComparer : IEqualityComparer<Person>
    {
        public bool Equals(Person p1, Person p2)
        {
            return p1.FirstName == p2.FirstName && p1.LastName == p2.LastName && (p1.PhoneNumber != p2.PhoneNumber || p1.Address != p2.Address);
        }

        public int GetHashCode(Person p)
        {
            return p.FirstName.GetHashCode() ^ p.LastName.GetHashCode();
        }
    }
}

We need to have some way to uniquely identify the records which I assume you have. In this example it's by FirstName and LastName (I know that's not very unique but for simplicity let's pretend it works well). The IEqualityComparer<> will do the work of finding the outdated and new records when the lists are loaded into memory.

Now we simply separate existing outdated records and brand new records like this...

List<Person> local = loadLocalRecords();
List<Person> external = loadExternalRecords();

var newRecordsToInsert = external.Except(local, new Person.KeyFieldComparer());

var outdatedRecordsToUpdate = local.Intersect(external, new Person.OutdatedComparer());

I hope it makes sense. I can answer questions if you have them. The good thing about this method is that it does the job with the least amount of hits to the database (I think). The bad thing is that it has to load everything into memory which may not be practical for you. But your table size has to be large for it to be a problem. Somewhere above a few million records depending on how many columns.

like image 173
Theo Avatar answered Nov 09 '22 23:11

Theo