Redshift: Any ways to compute fuzzy string similarity / string edit distance?

Tags:

In PSQL (which I believe Redshift is based), there are string similarity functions like levenshtein / levenshtein_less_equal [ http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html ]. These features don't seem to have made it into Redshift [ http://docs.aws.amazon.com/redshift/latest/dg/String_functions_header.html ]. Am I wrong, or has anyone made a creative query to work around this limitation?

711

asked Oct 27 '14 22:10

gatoatigrado

1 Answers

Redshift is based on Postgres 8.0, which is extremely old (and no longer supported), so they've diverged substantially over the years. The development version of Postgres is currently 9.4, and 8.x and 9.x versions of Postgres have some substantial differences and additions in 9.0 and up.

The levenshtein function is part of the fuzzystrmatch module that you linked above, and that module was introduced in Postgres 8.3, which is likely why it didn't make the cut for Redshift (and it apparently has not since been added).

Normally I would say your best bet would be to define a custom PL/pgSQL function to implement the Levenshtein Distance algorithm, but according to the Redshift doc, User-defined functions and stored procedures are one of the many major features of Postgres that Redshift does not support, so I think you're unfortunately out of luck on that front.

Depending on your requirements, you may be able to use LIKE to achieve similar results. See this SO answer for more info on that (note that some of the alternative suggestions in that answer, such as full text are also not supported in Redshift).

Update, 2016-04-25

It appears as though since I originally answered this question back in Oct. 2014, an ability to create Python-based User Defined Functions (UDFs) has been added. This is not as ideal as being able to create a custom Postgres function inline (the doc lists an assortment of caveats for the UDFs), but should allow the levenshtein distance algorithm to be implemented in Python and processed within the context of a Redshift query.

UDFs in Apache Hive, a data warehousing project used in the Hadoop ecosystem, allows for user-defined functions in a similar manner (Java or Python based).

100

answered Oct 21 '22 05:10

khampson

Related questions
                            
                                heroku postgres dump - not working locally
                            
                                Speed of many individual updates vs. one large update
                            
                                How to find whether unique key constraint exists for given columns
                            
                                Postgres pg_try_advisory_lock blocks all records
                            
                                pg_dump gets SSL error, seems to time out
                            
                                What is uuid ossp in postgres
                            
                                postgres text storage inline or in "background table"?
                            
                                PostgreSQL query gets magnitudes slower when sorting by columns from two different tables
                            
                                How to speed up table-retrieval with MATLAB and JDBC?
                            
                                Postgresql JSON column as HashWithIndifferentAccess
                            
                                What's the right way to calculate table size in postgres?
                            
                                How to find the character in nth position in a string
                            
                                What difference between NULL and NULL::character varying in PostgreSQL
                            
                                SELECT multiple rows and columns into a record variable
                            
                                Postgresql, add (years, months or days) to date based on another column
                            
                                Robust approach for building SQL queries programmatically
                            
                                Connecting to remote PostgreSQL database with SQLAlchemy using SSH tunneling with public key and passphrase, all that from a Windows machine
                            
                                Foreign key constraints involving multiple tables
                            
                                Is jOOQ's fetchLazy() truly lazy?
                            
                                How to concatenate field values with recursive query in postgresql?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Redshift: Any ways to compute fuzzy string similarity / string edit distance?

Tags:

postgresql

amazon-redshift

edit-distance

gatoatigrado

People also ask

1 Answers

khampson

Recent Activity

Donate For Us