Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL tsvector configuration: how to allow special characters?

Tags:

sql

postgresql

I've set up a new text search configuration in PostgreSQL. This configuration uses an empty stop word file, DictFile and AffFile.

A simple test...

SELECT * 
  FROM ts_debug('public.myconfig', 'C++ and C# and PHP');

...works (nearly) fine except the fact that the lexemes for 'C++' and 'C#' are both 'C'. Basically, all I would like to do is to make sure that the lexeme for 'C++' is 'C++' and that the lexeme for 'C#' is 'C#', hence enabling the user to query for 'C++'.

like image 295
itsame69 Avatar asked May 29 '11 18:05

itsame69


People also ask

How do you handle special characters in PostgreSQL?

Special character symbols are characters with a pre-defined syntactic meaning in PostgreSQL. They are typically disallowed from being used in identifier names for this reason, though as mentioned in the section on quoted identifiers, this restriction can usually be worked around with quotes if need be.

What does to_ tsvector do?

The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type.

What is FTS configuration in PostgreSQL?

A FTS configuration specifies all of the equipment necessary to transform a document into a tsvector: the parser that breaks its text into tokens, and the dictionaries, which then transform each token into a lexeme. Every call to to_tsvector() , to_tsquery() uses a configuration to perform its processing.

What PostgreSQL operator checks to see if a Tsquery matches a Tsvector?

@@ operator checks if tsquery matches tsvector. For instance, if the word to be queried is “fox” for the above-mentioned example, then: SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');


1 Answers

The problem is two-fold.

  1. The first issue is with indexing. The default config uses stemming to index the data and hence you get the lexeme 'C' for 'C++' and 'C#'. You can use the 'simple' config to index the data but then you will end up with unwanted words in the index. What I did was convert 'C++' to 'Cplusplus' and index that. Since 'Cplusplus' has no lexeme, it will be inxexed as is. You can do the same for 'C#' by making it 'CSharp' in the index data.

  2. The second issue is with searching. Now that the indexed data is okay, we need to make sure that we convert the search terms for the special words. If the user types in 'C++", you need to convert it to 'Cplusplus' before performing the search.

I created a function on the DB that took in a string and converted all occurences of the special terms.

I have not used TSearch in a while, so I am not sure if TSearch allows you to setup and exception list the same way it allows for stopwords.

like image 194
Imraan Avatar answered Sep 29 '22 06:09

Imraan