Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like

select * from doc
where to_tsvector('english',body) @@ to_tsvector('english','radio')

Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.

Is there any way to configure the parser not to look for file tokens? I tried

ALTER TEXT SEARCH CONFIGURATION public.english
    DROP MAPPING FOR file;

...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.

like image 522
Kev Avatar asked Dec 30 '09 14:12

Kev


1 Answers

I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.

like image 189
alvherre Avatar answered Sep 30 '22 17:09

alvherre