Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift: Any ways to compute fuzzy string similarity / string edit distance?

In PSQL (which I believe Redshift is based), there are string similarity functions like levenshtein / levenshtein_less_equal [ http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html ]. These features don't seem to have made it into Redshift [ http://docs.aws.amazon.com/redshift/latest/dg/String_functions_header.html ]. Am I wrong, or has anyone made a creative query to work around this limitation?

like image 711
gatoatigrado Avatar asked Oct 27 '14 22:10

gatoatigrado


People also ask

What is fuzzystrmatch?

The fuzzystrmatch module provides two functions for working with Soundex codes: soundex(text) returns text difference(text, text) returns int. The soundex function converts a string to its Soundex code.

What is Ilike in redshift?

ILIKE performs a case-insensitive pattern match for single-byte UTF-8 (ASCII) characters. To perform a case-insensitive pattern match for multibyte characters, use the LOWER function on expression and pattern with a LIKE condition.

What is levenshtein fuzzy matching?

The Levenshtein Distance (LD) is one of the fuzzy matching techniques that measure between two strings, with the given number representing how far the two strings are from being an exact match. The higher the number of the Levenshtein edit distance, the further the two terms are from being identical.


1 Answers

Redshift is based on Postgres 8.0, which is extremely old (and no longer supported), so they've diverged substantially over the years. The development version of Postgres is currently 9.4, and 8.x and 9.x versions of Postgres have some substantial differences and additions in 9.0 and up.

The levenshtein function is part of the fuzzystrmatch module that you linked above, and that module was introduced in Postgres 8.3, which is likely why it didn't make the cut for Redshift (and it apparently has not since been added).

Normally I would say your best bet would be to define a custom PL/pgSQL function to implement the Levenshtein Distance algorithm, but according to the Redshift doc, User-defined functions and stored procedures are one of the many major features of Postgres that Redshift does not support, so I think you're unfortunately out of luck on that front.

Depending on your requirements, you may be able to use LIKE to achieve similar results. See this SO answer for more info on that (note that some of the alternative suggestions in that answer, such as full text are also not supported in Redshift).

Update, 2016-04-25

It appears as though since I originally answered this question back in Oct. 2014, an ability to create Python-based User Defined Functions (UDFs) has been added. This is not as ideal as being able to create a custom Postgres function inline (the doc lists an assortment of caveats for the UDFs), but should allow the levenshtein distance algorithm to be implemented in Python and processed within the context of a Redshift query.

UDFs in Apache Hive, a data warehousing project used in the Hadoop ecosystem, allows for user-defined functions in a similar manner (Java or Python based).

like image 100
khampson Avatar answered Oct 21 '22 05:10

khampson