Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find similar values in one column with postgresql

Tags:

sql

postgresql


I’m a complete newbie in SQL and therefore I’m not very familiar with its functionality.
So here is my problem.
I have the following table with >100.000 companies (let's call it 'comp'):

id  | title               | name
----+---------------------+--------------
1   | XYZ                 | xyz
----+---------------------+--------------
2   | Smarts              | smarts
----+---------------------+--------------
3   | XYZ LTD             | xyzltd
----+---------------------+--------------
4   | Outsmarts           | outsmarts
----+---------------------+--------------
5   | XYZ Entertainment   | xyzentertainment
----+---------------------+--------------
6   | Smarts Entertainment| smartsentertainment

where 'title' is a company name and 'name' is the same title but low cased and without spaces. Is there a way to find all companies with similar titles (using either 'title' or 'name')? So, basically, I want to receive:

id  | title               | name
----+---------------------+--------------
1   | XYZ                 | xyz
----+---------------------+--------------
3   | XYZ LTD             | xyzltd
----+---------------------+--------------
5   | XYZ Entertainment   | xyzentertainment
----+---------------------+--------------
2   | Smarts              | smarts
----+---------------------+--------------
6   | Smarts Entertainment| smartsentertainment

By similar I mean:
1) 'XYZ', 'XYZ LTD' and 'XYZ Entertainment'
2) 'Smart' and 'Smart Entertainment'
but 'XYZ Entertainment' is not similar to 'Smart Entertainment' and 'Smart' is not similar to 'Outsmarts'.

I tried this and it didn't work:

SELECT set_limit(0.8);

SELECT
  similarity(c1.name, c2.name) AS sim,
  c1.name,
  c2.name
FROM comp AS c1
  JOIN comp AS c2
    ON c1.name != c2.name
       AND c1.name % c2.name
ORDER BY sim DESC;

by 'didn't work' I mean that after 7 minutes it still didn't give me any results. I assume, I totally messed it up
Is it even possible to retrieve such similarities?

like image 844
L.Viek Avatar asked Oct 29 '22 16:10

L.Viek


1 Answers

You could try the Levenshtein distance function, which gives you the number of edits to achieve the second from the first parameter:

SELECT levenshtein(c1.name, c2.name) AS sim, 0c1.name, c2.name
FROM comp AS c1 JOIN comp AS c2 ON c1.name != c2.name ORDER BY sim DESC;
like image 115
clemens Avatar answered Nov 15 '22 05:11

clemens