Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL: Find rows that match closely but not exactly

Tags:

sql

postgresql

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.

Example:

The table contains sports records:

1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001

Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:

2,Sweden,basketball,1998   --> 3 columns match
3,Sweden,skating,1998  --> 2 columns match
1,USA,basketball,1956  --> 1 column matches

Row 4 is not returned because it does not match at all.

Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.

Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?

What would such a query look like?

like image 637
Matthias Bohlen Avatar asked Oct 19 '22 04:10

Matthias Bohlen


2 Answers

SELECT * FROM countries

WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998

ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) + 
cast(year  = 1998 as integer) DESC

It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.

You can easily change the weight, by adding a multiplicator.

cast(sport = 'basketball' as integer) * 5 + 
like image 92
Philipp Avatar answered Oct 20 '22 23:10

Philipp


This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.

/*
-- Initial Setup 

-- drop table sport 
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )

insert into sport values 
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')

select * from sport
*/


select * , 
        CASE WHEN Country='sweden'      then 1 else 0 end * 100 + 
        CASE WHEN sport='basketball'    then 1 else 0 end * 10  + 
        CASE WHEN yr=1998               then 1 else 0 end * 1       as Match
from sport
WHERE 
   country = 'sweden'
OR sport   = 'basketball'
OR yr      = 1998
ORDER BY Match Desc
like image 41
objectNotFound Avatar answered Oct 20 '22 21:10

objectNotFound