Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full text search index on a multilingual column

Tags:

sql

postgresql

I have a DB in which there is a table shows with a multilingual column title. I would like to optimise full text search by adding an index like so:

CREATE INDEX title_idx ON shows USING gin(to_tsvector(title));

I get this error:

ERROR:  functions in index expression must be marked IMMUTABLE

It basically asks me to add the language parameter to make to_tsvector immutable. the result would be:

 CREATE INDEX title_idx ON shows USING gin(to_tsvector(LANGUAGE, title));

where LANGUAGE would be one of my target languages.

Is it possible to create an index that works for several languages ?

like image 219
Francis.Beauchamp Avatar asked Jan 22 '14 16:01

Francis.Beauchamp


3 Answers

Is it possible to create an index that works for several languages ?

Yes, but you need a second column that identifies the language of the text. Say you added a column doc_language to the table; you could then write:

CREATE INDEX title_idx ON shows USING gin(to_tsvector(doc_language, title));

Of course, this requires that you know the language of the subject text, something that can be hard to do in practice. If you don't need stemming, etc, you can just use the language simple, but I'm guessing you would've done that already if it were an option.

As an alternative, if you have a fixed and limited set of languages, you can concatenate the vectors for the different languages. E.g.:

regress=> SELECT to_tsvector('english', 'cafés') || to_tsvector('french', 'cafés') || to_tsvector('simple', 'cafés');
          ?column?          
----------------------------
 'caf':2 'café':1 'cafés':3
(1 row)

That'll match a tsquery for cafés in any of those three languages.

As an index:

CREATE INDEX title_idx ON shows USING gin((
    to_tsvector('english', title) || 
    to_tsvector('french', title) || 
    to_tsvector('simple', title)
));

but this is clumsy to use in queries, as the planner isn't very smart about matching index quals. So I'd wrap it in a function:

CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS $$
SELECT to_tsvector('english', $1) || 
       to_tsvector('french', $1) || 
       to_tsvector('simple', $1)
$$ LANGUAGE sql IMMUTABLE;

CREATE INDEX title_idx ON shows USING gin(to_tsvector_multilang(title));

If you want you can even get fancy: pass the list of languages as an array (but remember it'll have to be exactly the same order for an index qual match to work). Use priorities with setweight, so you prefer a match in English to one in French, say. All sorts of options.

like image 59
Craig Ringer Avatar answered Oct 29 '22 08:10

Craig Ringer


I've just made a Postgres function to test the text language. It's not perfect but it works for long texts.

CREATE OR REPLACE FUNCTION get_language(t text) RETURNS regconfig AS $$
DECLARE
    ret regconfig;
    BEGIN
            WITH l as ( SELECT cfgname, to_tsvector(cfgname::regconfig, title) as vector, length(to_tsvector(cfgname::regconfig, title)) as len
        FROM pg_ts_config, (select t as title) as ti)
    SELECT cfgname::regconfig
    INTO ret 
    FROM l
    WHERE len=(SELECT MIN(len) FROM l)
    ORDER BY cfgname='simple' DESC, cfgname ASC
    LIMIT 1;
    RETURN ret;
    END;
$$ LANGUAGE plpgsql;

It just look for the shortest tsvector for the given text (so it tries every ts config of postgres).

like image 4
Lenra Avatar answered Oct 29 '22 09:10

Lenra


In my case, I know the language of the current ligne. I have a "language" column. But if I do :

CREATE INDEX filmtraduction_tsindex ON filmtraduction USING GIN (
to_tsvector((case when $1=''language.fr'' then ''french'' when $1=''language.es'' then ''spanish'' else ''english'' end)::regconfig, body));

If have "functions in index expression must be marked IMMUTABLE".

So I need to create an immutable function:

 CREATE FUNCTION toregconfig(text) RETURNS regconfig AS 'select (case
 when $1=''language.fr'' then ''french'' when $1=''language.es'' then
 ''spanish'' else ''english'' end)::regconfig;' LANGUAGE SQL IMMUTABLE
 RETURNS NULL ON NULL INPUT;

Then I can create my index:

CREATE INDEX filmtraduction_tsindex ON filmtraduction USING GIN (
to_tsvector(toregconfig(language), body));

I do the query like this:

select * from movietraduction where to_tsvector(toregconfig(language), body) @@ plainto_tsquery('foo');
like image 2
Adrien Avatar answered Oct 29 '22 07:10

Adrien