I have some documents that contain sequences such as radio/tested
that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) @@ to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested
as a file
token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug
on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested
rather than the two lexemes radio
and test
.
Is there any way to configure the parser not to look for file
tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug
. If there's some way of disabling file
, or at least having it recognize both file
and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replace
ing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base
and the ones following it) the entries that relate to files (TPS_InFileFirst
, TPS_InFileNext
etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1
and so on). Have a look at contrib/test_parser/
for an example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With