Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL JOIN without matching NULL values

I call this my "Battle of the Nulls", for I have struggled with this issue for years now.

I have a large table (250,000+ rows, 100+ columns) named People, and another called Stuff, which may or may not contain a corresponding record. There are three columns which I can use to find possible matches: person ID, phone number or email address. These columns may or may not have values in them, or they may even contain nulls.

The original query I wrote years ago to do this was as follows:

SELECT *
  FROM People
  LEFT OUTER JOIN Stuff
    ON People.PersonID = Stuff.PersonID
    OR People.CellNumber = Stuff.PhoneNumber
    OR People.Email = Stuff.WorkEmail;

When I first tried to run this query, it produced millions of records in the joined table, not at all what I expected. After days of troubleshooting, I finally determined that it was the presence of nulls and empty cells that was causing this enormous increase in the results. For those who might not know, PostgreSQL treats nulls and empty cells with the same respect as cells with data in them. The result is that it takes every record with an empty cell in the People table and joins it with every record in the Stuff table with an empty cell. It does the same with nulls, and for all three comparisons.

I searched for weeks and never did find an elegant or simple way around this, so I finally had to break it up into a series of separate queries as follows.

SELECT *
FROM People
    LEFT OUTER JOIN Stuff
      ON People.PersonID = Stuff.PersonID
    WHERE (People.PersonID != ''
      AND People.PersonID IS NOT NULL);

Dump the matched records into a temp table, then run the non-matched records through the second query:

SELECT *
FROM People
    LEFT OUTER JOIN Stuff
      ON People.CellNumber = Stuff.PhoneNumber
    WHERE (People.CellNumber != ''
      AND People.CellNumber IS NOT NULL);

Dump the matched records into the temp table, then run the remaining non-matched records through the third query:

SELECT *
FROM People
    LEFT OUTER JOIN Stuff
      ON People.Email = Stuff.WorkEmail
    WHERE (People.Email != ''
      AND People.Email IS NOT NULL);

Dump the result (both matched and non-matched) into the temp table, and move on.

I have been using this very non-elegant approach for years, and it has worked with no problems. But now I need to revamp this script to accommodate changes in business need, and I am trying to find once again a simpler solution. The problem in the current approach is that whenever I must make changes to the query, I must make them in multiple places in the code, which makes for a maintenance nightmare.

In this iteration, I have come up with the following:

SELECT *
  FROM People
  LEFT OUTER JOIN Stuff
    ON (People.PersonID = Stuff.PersonID
        WHERE People.PersonID != ''
          AND People.PersonID IS NOT NULL)
    OR (People.CellNumber = Stuff.PhoneNumber
        WHERE People.CellNumber != ''
          AND People.CellNumber IS NOT NULL)
    OR (People.Email = Stuff.WorkEmail)
        WHERE People.Email != ''
          AND People.Email IS NOT NULL);

This looks like it should work, but it dies at the first WHERE clause.

Am I on the right track here? How can I make this work? Or is there another approach that works much better?

There has got to be a way to run the original three-condition query in a way that does not match on nulls or empty values, but I have not yet found it.

Dog gone it! I AM gonna win this battle of the nulls! (With your help, of course!)

like image 366
Michael Sheaver Avatar asked Dec 11 '22 07:12

Michael Sheaver


1 Answers

Postgres does not match "empty" cells to NULL values. NULL does not match to anything, using the typical comparison operators. However, an empty string will match an empty string.l

I suspect that you really want something like:

SELECT p.*, COALESCE(sp.?, sc.?, se.?) as ?
FROM People p LEFT OUTER JOIN
     Stuff sp
     ON p.PersonID = sp.PersonID LEFT OUTER JOIN
     Stuff sc
     ON p.CellNumber = sc.PhoneNumber AND sp.personID IS NULL LEFT OUTER JOIN
     stuff se
     ON p.Email = se.WorkEmail AND sc.personID is null;

This will take the first match from the three tables for each row in people.

like image 81
Gordon Linoff Avatar answered Dec 23 '22 00:12

Gordon Linoff