Just testing PostgreSQL 9.6.2 on my Mac and playing with Ngrams. Assuming there is a GIN trigram index on winery field.
The limit for similarity (I know this is deprecated):
SELECT set_limit(0.5);
I'm building a trigram search on 2,3M row table.
My select code:
SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity
FROM usr_wines
WHERE status=1 AND winery % 'chateau chevla blanc'
ORDER BY similarity DESC;
My results (329 ms on my mac):
Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64
Well, I don't understand how can "Chateau blanc" have a similarity > to "Chateau Cheval Blanc" in this case ? I understand that the 2 words are the exact same "chateau" and "blanc", but there is no other word "cheval".
Also why "Chateau ChevL Blanc" is first ? A letter "a" is missing !
Well, my goal is to match all possible duplicates when I give a winery name, even if it's mispelled. What did I miss ?
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.
As explained in the Indexes page, the eXtremeDB , Trigram ( trigram ) indexes are ideal for text searches when the exact spelling of the target object is not precisely known. It finds objects which match the maximum number of three-character strings in the entered search terms, i.e., near matches.
GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.
Trigram Phrase Matching is a method of identifying phrases that have a high probability of being synonyms. It is based on representing each phrase by a set of character trigrams that are extracted from that phrase.
The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.
Trigrams are a special case of N-grams.
The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
Sorting them, and taking out repetitions gets you:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
This can be computed by PostgreSQL by means of the function show_trgm
:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... which has 14 trigrams. (Check pg_trgm).
And the trigram set corresponding to "Chateau Cheval Blanc" is:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... which has 19 trigrams
If you count how many trigrams have both sets in common, you find that they have the following ones:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
and the ones they have in total are:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:
similarity = 14 / 19
You can check it with:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
and you'll see that you get: 0.736842
... which explains how similarity is computed, and why you get the values you get.
NOTE: You can compute the intersection and union by means of:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
And this is a way to explore the similarity of different pair of sentences:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
You can check it at Rextester
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With