Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres full text search and spelling mistakes (aka fuzzy full text search)

I have a scenario, where I have data for informal communications that I need to be able to search. Therefore I want full text search, but I also to make sense of spelling mistakes. Question is how do I take spelling mistakes into account in order to be able to do fuzzy full text search??

This is very briefly discussed in Postgres Full Text Search is Good Enough where the article discusses misspelling.

So I have built a table of "documents", created indexes etc.

CREATE TABLE data (
  id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, 
  text TEXT NOT NULL);

I can create an additional column of type tsvector and index accordingly...

alter table data 
  add column search_index tsvector 
  generated always as (to_tsvector('english', coalesce(text, ''))) 
  STORED;

create index search_index_idx on data using gin (search_index);

I have for example, some text where the data says "baloon", but someone may search "balloon", so I insert two rows (one deliberately misspelled)...

insert into data (text) values ('baloon');
insert into data (text) values ('balloon');

select * from data;

id |  text   | search_index
----+---------+--------------
 1 | baloon  | 'baloon':1
 2 | balloon | 'balloon':1

... and perform full text searches against the data...

select * from data where search_index @@ plainto_tsquery('balloon');

 id |  text   | search_index
----+---------+--------------
  2 | balloon | 'balloon':1
(1 row)

But I don't get back results for the misspelled version "baloon"... So using the suggestion in the linked article I've built a lookup table of all the words in my lexicon as follows...

"you may obtain good results by appending the similar lexeme to your tsquery"

CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');

CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);

... and I can search for similar words which may have been misspelled

select word, similarity(word, 'balloon') as similarity from data_words where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');

  word   | similarity
---------+------------
 baloon  |  0.6666667
 balloon |          1

... but how do I actually include misspelled words in my query?

Isn't this what the article above means?

select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));

         plainto_tsquery
----------------------------------
 'balloon' & 'baloon' & 'balloon'
(1 row)

... plugged into an actual search, and I get no rows!

select * from data where text @@ plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));

select * from data where search_index @@ phraseto_tsquery('baloon balloon'); -- no rows returned

I'm not sure where I'm going wrong here - can any shed any light? I feel like I'm super close to getting this going...?

like image 652
0909EM Avatar asked Jan 21 '20 15:01

0909EM


1 Answers

SELECT to_tsquery('balloon |' ||
                  string_agg(word, ' | ')
       )
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;
like image 82
Laurenz Albe Avatar answered Nov 07 '22 14:11

Laurenz Albe