I have a DB in which there is a table shows
with a multilingual column title
. I would like to optimise full text search by adding an index like so:
CREATE INDEX title_idx ON shows USING gin(to_tsvector(title));
I get this error:
ERROR: functions in index expression must be marked IMMUTABLE
It basically asks me to add the language parameter to make to_tsvector immutable. the result would be:
CREATE INDEX title_idx ON shows USING gin(to_tsvector(LANGUAGE, title));
where LANGUAGE
would be one of my target languages.
Is it possible to create an index that works for several languages ?
Is it possible to create an index that works for several languages ?
Yes, but you need a second column that identifies the language of the text. Say you added a column doc_language
to the table; you could then write:
CREATE INDEX title_idx ON shows USING gin(to_tsvector(doc_language, title));
Of course, this requires that you know the language of the subject text, something that can be hard to do in practice. If you don't need stemming, etc, you can just use the language simple
, but I'm guessing you would've done that already if it were an option.
As an alternative, if you have a fixed and limited set of languages, you can concatenate the vectors for the different languages. E.g.:
regress=> SELECT to_tsvector('english', 'cafés') || to_tsvector('french', 'cafés') || to_tsvector('simple', 'cafés');
?column?
----------------------------
'caf':2 'café':1 'cafés':3
(1 row)
That'll match a tsquery for cafés
in any of those three languages.
As an index:
CREATE INDEX title_idx ON shows USING gin((
to_tsvector('english', title) ||
to_tsvector('french', title) ||
to_tsvector('simple', title)
));
but this is clumsy to use in queries, as the planner isn't very smart about matching index quals. So I'd wrap it in a function:
CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS $$
SELECT to_tsvector('english', $1) ||
to_tsvector('french', $1) ||
to_tsvector('simple', $1)
$$ LANGUAGE sql IMMUTABLE;
CREATE INDEX title_idx ON shows USING gin(to_tsvector_multilang(title));
If you want you can even get fancy: pass the list of languages as an array (but remember it'll have to be exactly the same order for an index qual match to work). Use priorities with setweight
, so you prefer a match in English to one in French, say. All sorts of options.
I've just made a Postgres function to test the text language. It's not perfect but it works for long texts.
CREATE OR REPLACE FUNCTION get_language(t text) RETURNS regconfig AS $$
DECLARE
ret regconfig;
BEGIN
WITH l as ( SELECT cfgname, to_tsvector(cfgname::regconfig, title) as vector, length(to_tsvector(cfgname::regconfig, title)) as len
FROM pg_ts_config, (select t as title) as ti)
SELECT cfgname::regconfig
INTO ret
FROM l
WHERE len=(SELECT MIN(len) FROM l)
ORDER BY cfgname='simple' DESC, cfgname ASC
LIMIT 1;
RETURN ret;
END;
$$ LANGUAGE plpgsql;
It just look for the shortest tsvector for the given text (so it tries every ts config of postgres).
In my case, I know the language of the current ligne. I have a "language" column. But if I do :
CREATE INDEX filmtraduction_tsindex ON filmtraduction USING GIN (
to_tsvector((case when $1=''language.fr'' then ''french'' when $1=''language.es'' then ''spanish'' else ''english'' end)::regconfig, body));
If have "functions in index expression must be marked IMMUTABLE".
So I need to create an immutable function:
CREATE FUNCTION toregconfig(text) RETURNS regconfig AS 'select (case
when $1=''language.fr'' then ''french'' when $1=''language.es'' then
''spanish'' else ''english'' end)::regconfig;' LANGUAGE SQL IMMUTABLE
RETURNS NULL ON NULL INPUT;
Then I can create my index:
CREATE INDEX filmtraduction_tsindex ON filmtraduction USING GIN (
to_tsvector(toregconfig(language), body));
I do the query like this:
select * from movietraduction where to_tsvector(toregconfig(language), body) @@ plainto_tsquery('foo');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With