I'm creating a search inside a Rails app using the pg_search gem
. However, one of the tables have a Text
datatype field that it's content happens to be a little larger than usual.
Now when I need to setup a tsvector column
for the text
columns, I face some limitations that due the the text field size vs tsvector size.
ERROR: string is too long for tsvector (5068741 bytes, max 1048575 bytes)
Is there any way that I determine condition to skip bigger Text
fields while creating the tsvector column in the SQL
trigger to do something like this:
pseudocode:
execute(<<-TRIGGERSQL)
CREATE OR REPLACE FUNCTION public.essays_before_insert_update_row_tr()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
BEGIN
If (SELECT LEN(body_text) FROM essays) <= 1048575
new.tsv_body_text := to_tsvector('pg_catalog.english', coalesce(new.body_text,''));
RETURN NEW;
End
END;
$function$
TRIGGERSQL
# no candidate create_trigger statement could be found, creating an adapter-specific one
execute("CREATE TRIGGER essays_before_insert_update_row_tr BEFORE INSERT OR UPDATE ON \"essays\" FOR EACH ROW EXECUTE PROCEDURE essays_before_insert_update_row_tr()")
related question that I found without an answer:
Postgresql - converting text to ts_vector
In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details).
Understanding Postgres Full Text Search: 10 Critical Aspects. Sarad on Database, PostgreSQL, Tutorial • August 28th, 2017 • Write for Hevo. Setting up PostgreSQL Full Text Search (FTS) helps users bring up a fully-fledged search engine without a steep learning curve and new software to install.
A simple workaround is to just invoke to_tsvector()
with a truncated text value. For example, using the trigger example from the Postgres manual as starting point this approach looks like this:
CREATE FUNCTION essays_tsv_trigger_fn() RETURNS trigger AS $$
begin
new.tsv_body_text := to_tsvector('english', left(new.body_text, 4*1024*1024));
return new;
end
$$ LANGUAGE plpgsql;
CREATE TRIGGER essays_tsv_trigger BEFORE INSERT OR UPDATE
ON essays FOR EACH ROW EXECUTE FUNCTION essays_tsv_trigger_fn();
This truncates the document's content to 4 MiB which should be useful enough for many document collections. Instead of just ignoring 'overly' long documents you include at least parts of it. In my experience, 4 MiB works well for technical english documentation. Depending on the size of the actually used vocabulary you could even succeed when truncating with a larger value like 10 MiB.
If you really want to ignore too long documents you could guard the to_tsvector()
assignment with an if statement like this:
CREATE FUNCTION essays_tsv_trigger_fn() RETURNS trigger AS $$
begin
if length(new.body_text) <= 4*1024*1024 then
new.tsv_body_text := to_tsvector('english', new.body_text);
end if;
return new;
end
$$ LANGUAGE plpgsql;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With