Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which record is causing NOTICE: word is too long to be indexed

Using Postgres in a Rails app (with pg_search gem), I have enabled search with tsvector. In a database with over 35,000 records I get several messages saying

NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.

Am I correct in assuming a "word" does not include whitespace? how can i determine which records are causing this message?

Here's the SQL generated by the migration which introduces the indexes

 ==  AddIndexForFullTextSearch: migrating ======================================
-- add_column(:posts, :tsv, :tsvector)
   -> 0.0344s
-- execute("      CREATE INDEX index_posts_tsv ON posts USING gin(tsv);\n")
   -> 0.1694s
-- execute("    UPDATE posts SET tsv = (to_tsvector('english', coalesce(title, '')) || \n                            to_tsvector('english', coalesce(intro, '')) || \n                            to_tsvector('english', coalesce(body, '')));\n")
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
   -> 343.0556s
-- execute("      CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE\n      ON posts FOR EACH ROW EXECUTE PROCEDURE\n      tsvector_update_trigger(tsv, 'pg_catalog.english', title, intro, body);\n")
   -> 0.0266s
like image 306
linojon Avatar asked Sep 19 '25 00:09

linojon


2 Answers

The Postgres Full Text Search limits the maximum token length to 2047 bytes:

The length of each lexeme must be less than 2K bytes

If the used text parser returns longer tokens then you get 'notice' messages like you quoted in your question, e.g.:

select to_tsvector('english', repeat('x', 2*1024));
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.

In that context 'word' really means token and the default Postgres text parser may return whitespace tokens that hit this limit.

You can identify records that produce this notice with a query that just invokes the text parser and selects all the overly long tokens. For example:

select t.id, tt.tokid, tt.alias, length(t.token), t.token from (
    select  id, (ts_parse('default', payload)).* from content) t
    inner join ts_token_type('default') tt
       on t.tokid = tt.tokid
    where length(token) >= 2*1024;

In case it isn't obvious why the parser yielded long tokens one can look at preceding/following tokens, in an example record - like this:

select case when length(token)>128 then '###' else '' end,
       t.tokid, tt.alias, length(token), token
       from ts_parse('default',
                     (select payload from content where id = 'foobar')) t
    inner join ts_token_type('default') tt
        on t.tokid = tt.tokid;

You can search for ### in the output and look at the context tokens to get and idea how the parse went wrong.

For example, the default Postgres text parser also returns HTML/XML style tags and returns them as separate tag tokens. As of Postgres 11, if it just sees an opening 'tag' without a following closing tag it sometimes returns some of the following text as artificial whitespace token. For example:

select case when length(token)>128 then '###' else '' end, t.tokid, tt.alias,
    length(token), token from ts_parse('default', (select $$<script> 

           We should forget about small efficiencies, say about 97% of the time

       <script>

           premature optimization is the root of all evil.

$$)) t inner join ts_token_type('default') tt on t.tokid = tt.tokid;

Which is parsed as 4 tokens where the blank tokens even contain some text:

case tokid alias length
──── ───── ───── ──────
        13 tag        8
        12 blank     90                           
        13 tag        8
        12 blank     62
(4 rows)

(last column omitted for brevity)

If there are real paragraphs between such pseudo 'tags' the 2k limit is easily reached for such pseudo 'whitespace/blank' tokens.

The quick fix for this is to replace <> characters in the to_tsvector()/ts_parse() text argument such that the default Postgres parser doesn't mis-identify <> enclosed words as tags, e.g.:

... regexp_replace($$...$$, '[<>]', '', 'g') ...

Unfortunately, features of the default Postgres text parser (such as tag detection) aren't parameterizable (as of version 11). One can use a custom parser, but creating a custom parser currently basically means to write a new parser in C and load it as an extra extension - which arguably is tedious and error-prone.

like image 173
maxschlepzig Avatar answered Sep 20 '25 16:09

maxschlepzig


According to the PostgreSQL documentation, “full text search functionality includes the ability to […] parse based on more than just white space”, depending on your “text search configurations”. So you’ll have to examine your configuration to find out what a “word” means.

You could search for long whitespace-separated words using a regular expression:

SELECT regexp_matches(the_text_col, '\S{2047,}') FROM the_table

That regex searches for 2047 or more consecutive non-whitespace characters.

like image 35
Rory O'Kane Avatar answered Sep 20 '25 17:09

Rory O'Kane