Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create simple fuzzy search with PostgreSQL only?

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:

Product.where("code ILIKE ?", "%" + params[:search] + "%") 

It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".

What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?

like image 474
Alve Avatar asked Oct 11 '11 17:10

Alve


People also ask

How do you do a fuzzy search?

Many search engines enable users to specifically request a fuzzy search in the search query by using a tilde (~) at the end of the word or term they want to search with fuzziness.

Can you do a fuzzy match in SQL?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.

Is Elasticsearch faster than Postgres?

And the more size you want to search in, the more Elasticsearch is better than PostgreSQL in performance. Additionally, you could also get many benefits and great performance if you pre-process the posts into several fields and indexes well before storing into Elasticsearch.

What is a fuzzy search database?

Fuzzy searches find words or phrases (values) that are similar, but not necessarily identical, to other words or phrases (values). This type of search has many uses, such as finding sequence errors, spelling errors, transposed characters, and others we'll cover later.


Video Answer


2 Answers

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Example:  test=# SELECT levenshtein('GUMBO', 'GAMBOL');  levenshtein -------------            2 (1 row) 

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

Try this query sample: (with your own object names and data of course)

SELECT *  FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10 

This query says:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

Note: if you get an error like:

function levenshtein(character varying, unknown) does not exist 

Install the fuzzystrmatch extension using:

test=# CREATE EXTENSION fuzzystrmatch; 
like image 85
Paul Sasik Avatar answered Oct 05 '22 06:10

Paul Sasik


Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.

If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:

WHERE code ILIKE '%AB-123-lHdfj%'

Use:

WHERE code ILIKE '%AB%123%lHdfj%' 

Or, dynamically:

WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%' 

% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:

WHERE code ~* 'AB.?123.?lHdfj' 

.? ... 0 or 1 characters

Or:

WHERE code ~* 'AB\-?123\-?lHdfj' 

\-? ... 0 or 1 dashes

You may want to escape special characters in LIKE or regexp patterns. See:

  • Escape function for regular expression or LIKE patterns

If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:

  • There is full text search, of course. But this may be an overkill in your case.

  • A more likely candidate is trigram-matching with the additional module pg_trgm. See:

    • Using Levenshtein function on each element in a tsvector?
    • PostgreSQL LIKE query performance variations
    • Related blog post by Depesz

    Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
    Also interesting in this context: the similarity() function or % operator of that module.

  • Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.

    Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.

Overview of pattern-matching techniques:

  • Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
like image 34
Erwin Brandstetter Avatar answered Oct 05 '22 06:10

Erwin Brandstetter