Re-Writing "Fuzzy Join" Functions from R to SQL

Tags:

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:

library(fuzzyjoin)
library(dplyr)

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) )

view(sample_query)

However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported .

I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R

My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?

Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.

Does anyone know if this is possible?

Note:

My data looks like this:

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

938

asked Dec 05 '21 04:12

stats_noob

Video Answer

2 Answers

Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:

SAS: Fuzzy Joins

select a.*, b.* 
from table_a a
inner join table_b b
  on (a.date_1 between b.date_2 and b.date_3)
      and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)

To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.

Edit: Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.

Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.


library(dbplyr)

new_con<- dbConnect(
  odbc(),
  Driver= "ODBC Driver 17 for SQL Server (as an example)",
  Server = "Server name here",
  uid = "some_id",
  pwd = "abc"
)


sample_query<- dbplyr::tbl(
  new_con,
  dbplyr::sql(
  "select a.*, b.* 
    from table_a a
    inner join table_b b
      on (a.date_1 between b.date_2 and b.date_3)"

sample_data<-sample_query %>%
 filter(silly_example==TRUE) %>%
 collect()

144

answered Oct 21 '22 15:10

Roger-123

I agree with @Roger-123's approach. But here is a variation that might assist:

Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:

remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")

# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)

output = remote_1 %>%
  # cross_join
  inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
  # calculate Levenshtein distance
  mutate(distance = le_dst(id1, id2)) %>%
  # filter to close matches
  filter(distance <= 2)

Notes:

dbplyr does not allow for complex conditions in its joins. Hence we do the most general join possible and then filter.
If you also want joins by date, then you can put them into the inner_join if the conditions are simple, or create another filter condition if they are complex.
le_dst is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.
Netezza accepts two distance functions for text: le_dst and dle_dst. You can use whichever you please here.
Output is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.

answered Oct 21 '22 14:10

Simon.S.A.

Related questions
                            
                                Hibernate lazy loading not work with many-to-one mapping
                            
                                Microsoft SQL Server backup physical_device_name
                            
                                How to use a group by Sum SQL with Spring Data JPA?
                            
                                Invalid Identifier on Sql left join oracle
                            
                                Remove duplicates with less null values
                            
                                Oracle performance: query executing multiple identical function calls
                            
                                PostgreSQL find all possible combinations (permutations) in recursive query
                            
                                Not Exists vs Not In: efficiency
                            
                                How get the Id of inserted row in SQL [duplicate]
                            
                                AWS Redshift column limit?
                            
                                Insert BLOB test string bigger than 2000 or 4000 bytes
                            
                                SSRS not showing all rows by level (hierarchy)
                            
                                Why select top ... order by indexed column still sort?
                            
                                Prioritize JOIN on multiple columns with an OR
                            
                                How would I search for text that contains emojis?
                            
                                What are the origins of terms DDL, DML and DCL?
                            
                                In SQLAlchemy, why is my load_only not filtering any columns that I have specified?
                            
                                Is it possible that I could find a row contains a string? Assume that I do not know which columns contain a string
                            
                                Updating a value in SQL (better-sqlite3) with Node.JS
                            
                                Execute Subquery refactoring first before any other SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Re-Writing "Fuzzy Join" Functions from R to SQL

Tags:

sql

join

r

netezza