I've set up a new text search configuration in PostgreSQL. This configuration uses an empty stop word file, DictFile and AffFile.
A simple test...
SELECT *
FROM ts_debug('public.myconfig', 'C++ and C# and PHP');
...works (nearly) fine except the fact that the lexemes for 'C++' and 'C#' are both 'C'. Basically, all I would like to do is to make sure that the lexeme for 'C++' is 'C++' and that the lexeme for 'C#' is 'C#', hence enabling the user to query for 'C++'.
Special character symbols are characters with a pre-defined syntactic meaning in PostgreSQL. They are typically disallowed from being used in identifier names for this reason, though as mentioned in the section on quoted identifiers, this restriction can usually be worked around with quotes if need be.
The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type.
A FTS configuration specifies all of the equipment necessary to transform a document into a tsvector: the parser that breaks its text into tokens, and the dictionaries, which then transform each token into a lexeme. Every call to to_tsvector() , to_tsquery() uses a configuration to perform its processing.
@@ operator checks if tsquery matches tsvector. For instance, if the word to be queried is “fox” for the above-mentioned example, then: SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
The problem is two-fold.
The first issue is with indexing. The default config uses stemming to index the data and hence you get the lexeme 'C' for 'C++' and 'C#'. You can use the 'simple' config to index the data but then you will end up with unwanted words in the index. What I did was convert 'C++' to 'Cplusplus' and index that. Since 'Cplusplus' has no lexeme, it will be inxexed as is. You can do the same for 'C#' by making it 'CSharp' in the index data.
The second issue is with searching. Now that the indexed data is okay, we need to make sure that we convert the search terms for the special words. If the user types in 'C++", you need to convert it to 'Cplusplus' before performing the search.
I created a function on the DB that took in a string and converted all occurences of the special terms.
I have not used TSearch in a while, so I am not sure if TSearch allows you to setup and exception list the same way it allows for stopwords.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With