I call this my "Battle of the Nulls", for I have struggled with this issue for years now.
I have a large table (250,000+ rows, 100+ columns) named People
, and another called Stuff
, which may or may not contain a corresponding record. There are three columns which I can use to find possible matches: person ID, phone number or email address. These columns may or may not have values in them, or they may even contain nulls.
The original query I wrote years ago to do this was as follows:
SELECT *
FROM People
LEFT OUTER JOIN Stuff
ON People.PersonID = Stuff.PersonID
OR People.CellNumber = Stuff.PhoneNumber
OR People.Email = Stuff.WorkEmail;
When I first tried to run this query, it produced millions of records in the joined table, not at all what I expected. After days of troubleshooting, I finally determined that it was the presence of nulls and empty cells that was causing this enormous increase in the results. For those who might not know, PostgreSQL treats nulls and empty cells with the same respect as cells with data in them. The result is that it takes every record with an empty cell in the People table and joins it with every record in the Stuff table with an empty cell. It does the same with nulls, and for all three comparisons.
I searched for weeks and never did find an elegant or simple way around this, so I finally had to break it up into a series of separate queries as follows.
SELECT *
FROM People
LEFT OUTER JOIN Stuff
ON People.PersonID = Stuff.PersonID
WHERE (People.PersonID != ''
AND People.PersonID IS NOT NULL);
Dump the matched records into a temp table, then run the non-matched records through the second query:
SELECT *
FROM People
LEFT OUTER JOIN Stuff
ON People.CellNumber = Stuff.PhoneNumber
WHERE (People.CellNumber != ''
AND People.CellNumber IS NOT NULL);
Dump the matched records into the temp table, then run the remaining non-matched records through the third query:
SELECT *
FROM People
LEFT OUTER JOIN Stuff
ON People.Email = Stuff.WorkEmail
WHERE (People.Email != ''
AND People.Email IS NOT NULL);
Dump the result (both matched and non-matched) into the temp table, and move on.
I have been using this very non-elegant approach for years, and it has worked with no problems. But now I need to revamp this script to accommodate changes in business need, and I am trying to find once again a simpler solution. The problem in the current approach is that whenever I must make changes to the query, I must make them in multiple places in the code, which makes for a maintenance nightmare.
In this iteration, I have come up with the following:
SELECT *
FROM People
LEFT OUTER JOIN Stuff
ON (People.PersonID = Stuff.PersonID
WHERE People.PersonID != ''
AND People.PersonID IS NOT NULL)
OR (People.CellNumber = Stuff.PhoneNumber
WHERE People.CellNumber != ''
AND People.CellNumber IS NOT NULL)
OR (People.Email = Stuff.WorkEmail)
WHERE People.Email != ''
AND People.Email IS NOT NULL);
This looks like it should work, but it dies at the first WHERE
clause.
Am I on the right track here? How can I make this work? Or is there another approach that works much better?
There has got to be a way to run the original three-condition query in a way that does not match on nulls or empty values, but I have not yet found it.
Dog gone it! I AM gonna win this battle of the nulls! (With your help, of course!)
Postgres does not match "empty" cells to NULL
values. NULL
does not match to anything, using the typical comparison operators. However, an empty string will match an empty string.l
I suspect that you really want something like:
SELECT p.*, COALESCE(sp.?, sc.?, se.?) as ?
FROM People p LEFT OUTER JOIN
Stuff sp
ON p.PersonID = sp.PersonID LEFT OUTER JOIN
Stuff sc
ON p.CellNumber = sc.PhoneNumber AND sp.personID IS NULL LEFT OUTER JOIN
stuff se
ON p.Email = se.WorkEmail AND sc.personID is null;
This will take the first match from the three tables for each row in people
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With