Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching special characters (e.g. #, +) using pg_search

I'm using the pg_search gem in a Rails app to search against users - their bios and associated skill model. Users are developers, so their skills include things like "CSS", "C++", "C#", "Objective C", etc...

I was initially using the the following search scope:

pg_search_scope :search,
  against: [:bio],
  using: {tsearch: {dictionary: "english", prefix: true}},
  associated_against: {user: [:fname, :lname], skills: :name}

However, if you search "C++" in this case, you'd get results that included "CSS" (among other things). I changed the scope to use the "simple" dictionary and removed prefixing:

pg_search_scope :search_without_prefix,
  against: [:bio],
  using: {tsearch: {dictionary: "simple"}}, 
  associated_against: {user: [:fname, :lname], skills: :name}

This fixed some things - for example, searching "C++" doesn't show "CSS". But, searching "C++" or "C#" still matches users who have "C" or "Objective C" listed

I can definitely do a basic ILIKE match, but hoping to accomplish this using pg_search if possible.

like image 759
Lev Avatar asked Oct 15 '13 21:10

Lev


1 Answers

I would comment but I don't have sufficient reputation yet.

I have been studying pg_search which has lead me deeper into PostgreSQL Full Text Search. It's a complex module but it has the ts_debug() command to help understand how input strings are parsed. The ts_debug() output for the test string "C++ CSS C# Objective C" is very revealing. It looks like "# and "+" are treated as white space in the default configuration for English. I think you might have to modify the parser in PostgreSQL to get the behavior you want.

postgres=# SELECT * FROM ts_debug('english', 'C++ CSS C# Objective C');
   alias   |   description   |   token   |  dictionaries  |  dictionary  | lexemes  
-----------+-----------------+-----------+----------------+--------------+----------
 asciiword | Word, all ASCII | C         | {english_stem} | english_stem | {c}
 blank     | Space symbols   | +         | {}             |              | 
 blank     | Space symbols   | +         | {}             |              | 
 asciiword | Word, all ASCII | CSS       | {english_stem} | english_stem | {css}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | C         | {english_stem} | english_stem | {c}
 blank     | Space symbols   | #         | {}             |              | 
 asciiword | Word, all ASCII | Objective | {english_stem} | english_stem | {object}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | C         | {english_stem} | english_stem | {c}
(10 rows)

BTW, here is a very useful tutorial if you want to study PostgreSQL Full Text Search: http://shisaa.jp/postset/postgresql-full-text-search-part-1.html

UPDATE:

I found a solution within PostgreSQL Full Text Search. It involves using the test_parser extension which is documented here: http://www.postgresql.org/docs/9.1/static/test-parser.html

First some configuration is required in psql:

postgres=# CREATE EXTENSION test_parser;

postgres=# CREATE TEXT SEARCH CONFIGURATION testcfg ( PARSER = testparser );

postgres=# ALTER TEXT SEARCH CONFIGURATION testcfg
    ADD MAPPING FOR word WITH english_stem;

Now you can index a test string and see that the terms like "C++" are treated as separate tokens, as desired:

postgres=# SELECT to_tsvector('testcfg', 'C++ CSS C# Objective C #GT40 GT40 added joined');
                                to_tsvector                                 
----------------------------------------------------------------------------
 '#gt40':6 'ad':8 'c':5 'c#':3 'c++':1 'css':2 'gt40':7 'join':9 'object':4
(1 row)

The question remains of how to integrate this into pg_search. I am looking at that next.

like image 85
garbo999 Avatar answered Sep 22 '22 11:09

garbo999