Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):

lower(location_name) LIKE '%cafe%'

as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on

gin(to_tsvector('simple', location_name))

and searching with

(to_tsvector('simple',location_name) @@ to_tsquery('simple','cafe'))

which works beautifully, and cuts down the search time by a couple of orders of magnitude.

However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.

So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?

like image 210
AlbeyAmakiir Avatar asked Oct 11 '12 02:10

AlbeyAmakiir


1 Answers

If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:

CREATE INDEX table_location_name_trigrams_key ON table
  USING gin (location_name gin_trgm_ops);

This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:

SELECT * FROM table WHERE location_name ILIKE '%cafe%';

This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.

Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.


Edit: I initially added this as a comment.

I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:

WHERE CASE
  WHEN cjk_chars('query') IS NOT NULL THEN
    cjk_chars(location_name) @> cjk_chars('query')
    AND location_name LIKE '%query%'
  ELSE
    <tsvector/trigrams>
  END

Ta-da, unigrams!

like image 56
willglynn Avatar answered Sep 21 '22 02:09

willglynn