PostgreSQL, trigrams and similarity

Tags:

Just testing PostgreSQL 9.6.2 on my Mac and playing with Ngrams. Assuming there is a GIN trigram index on winery field.

The limit for similarity (I know this is deprecated):

SELECT set_limit(0.5);

I'm building a trigram search on 2,3M row table.

My select code:

Click to copy

SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity 
FROM usr_wines 
WHERE status=1 AND winery % 'chateau chevla blanc'  
ORDER BY similarity DESC;

My results (329 ms on my mac):

Click to copy

Chateau ChevL Blanc 0,85
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc,  0,736842
Chateau Blanc   0,736842
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc (7)    0,666667
Chateau Cheval Blanc Cbo    0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

Well, I don't understand how can "Chateau blanc" have a similarity > to "Chateau Cheval Blanc" in this case ? I understand that the 2 words are the exact same "chateau" and "blanc", but there is no other word "cheval".

Also why "Chateau ChevL Blanc" is first ? A letter "a" is missing !

Well, my goal is to match all possible duplicates when I give a winery name, even if it's mispelled. What did I miss ?

571

asked Apr 01 '17 12:04

alex.bour

1 Answers

The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.

Trigrams are a special case of N-grams.

The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:

Click to copy

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

Sorting them, and taking out repetitions gets you:

Click to copy

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

This can be computed by PostgreSQL by means of the function show_trgm:

Click to copy

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... which has 14 trigrams. (Check pg_trgm).

And the trigram set corresponding to "Chateau Cheval Blanc" is:

Click to copy

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... which has 19 trigrams

If you count how many trigrams have both sets in common, you find that they have the following ones:

Click to copy

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

and the ones they have in total are:

Click to copy

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:

Click to copy

 similarity = 14 / 19

You can check it with:

Click to copy

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

and you'll see that you get: 0.736842

... which explains how similarity is computed, and why you get the values you get.

NOTE: You can compute the intersection and union by means of:

Click to copy

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

And this is a way to explore the similarity of different pair of sentences:

Click to copy

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

You can check it at Rextester

157

answered Nov 16 '22 02:11

joanolo

Related questions
                            
                                Update Returning Order by in postgresql
                            
                                Debugging a program that uses SIGINT with gdb
                            
                                PostgreSQL pg_dump [closed]
                            
                                How to handle Play Framework 2 database evolutions in production
                            
                                PostgreSQL DELETE FROM (SELECT * FROM table FETCH FIRST 10 ROWS ONLY)
                            
                                "pg_dump: invalid option -- i" when migrating
                            
                                Importing zipped CSV file into PostgreSQL
                            
                                Resolve "Binaries not found" error in Postgres.app
                            
                                In Rails, Couldn't create database for {"adapter"=>"postgresql",
                            
                                Psycopg2 peer authentication for user postgres
                            
                                Logging slow queries on Google Cloud SQL PostgreSQL instances
                            
                                Doctrine - Add default time stamp to entity like NOW()
                            
                                How to use Lowercase function in Sequelize Postgres
                            
                                Register and run PostgreSQL 9.0 as Windows Service
                            
                                Performance issue in update query
                            
                                Why Planing time and Execution time are so different Postgres?
                            
                                hibernate with c3p0: createClob() is not yet implemented
                            
                                Sequelize configuration to retrieve total count with details
                            
                                Cloud9 postgres
                            
                                Unable to start postgresql.service?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PostgreSQL, trigrams and similarity

Tags:

postgresql

similarity

trigram

alex.bour

People also ask

1 Answers

joanolo

Recent Activity

Donate For Us