Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Re-Writing "Fuzzy Join" Functions from R to SQL

Tags:

sql

join

r

netezza

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:

library(fuzzyjoin)
library(dplyr)

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) )

view(sample_query)

However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported .

I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R

My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?

Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.

Does anyone know if this is possible?

Note:

My data looks like this:

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
like image 938
stats_noob Avatar asked Dec 05 '21 04:12

stats_noob


People also ask

What is fuzzy join in SQL Server 2017?

Fuzzyjoin R package can be used in SQL Server 2017 and higher to perform different kinds of fuzzy joins of data. First, let's understand what distinct types of fuzzy joins are supported by this package. Below is a list of distinct types of inexact matching supported by the fuzzyjoin package along with the associated function name.

How to do fuzzy matching in R script?

In a real life scenario, data would come from a table in SQL Server for both datasets that would be joined for the fuzzy matching by using the InputDataSet data frame to pass the data from SQL Server to the R Script. And after the fuzzy join is applied, the output can be extracted from the R script using the OutputDataSet data frame.

What is fuzzyjoin?

The fuzzyjoin package is a variation on dplyr's join operations that allows matching not just on values that match between columns, but on inexact matching. This allows matching on: One relevant use case is for classifying freeform text data (such as survey responses) against a finite set of options. The package also includes:

What are the different types of join functions in MySQL?

Other Regex based join functions include regex_left_join (include all rows of left table), regex_right_join (include all rows of right table), regex_full_join (include all rows in each table), regex_semi_join (filter left table for rows with matches), regex_anti_join (filter left table for rows without matches)


Video Answer


2 Answers

Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:

SAS: Fuzzy Joins

select a.*, b.* 
from table_a a
inner join table_b b
  on (a.date_1 between b.date_2 and b.date_3)
      and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)

To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.

Edit: Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.

Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.


library(dbplyr)

new_con<- dbConnect(
  odbc(),
  Driver= "ODBC Driver 17 for SQL Server (as an example)",
  Server = "Server name here",
  uid = "some_id",
  pwd = "abc"
)


sample_query<- dbplyr::tbl(
  new_con,
  dbplyr::sql(
  "select a.*, b.* 
    from table_a a
    inner join table_b b
      on (a.date_1 between b.date_2 and b.date_3)"

sample_data<-sample_query %>%
 filter(silly_example==TRUE) %>%
 collect()  
like image 144
Roger-123 Avatar answered Oct 21 '22 15:10

Roger-123


I agree with @Roger-123's approach. But here is a variation that might assist:

Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:

remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")

# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)

output = remote_1 %>%
  # cross_join
  inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
  # calculate Levenshtein distance
  mutate(distance = le_dst(id1, id2)) %>%
  # filter to close matches
  filter(distance <= 2)

Notes:

  • dbplyr does not allow for complex conditions in its joins. Hence we do the most general join possible and then filter.
  • If you also want joins by date, then you can put them into the inner_join if the conditions are simple, or create another filter condition if they are complex.
  • le_dst is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.
  • Netezza accepts two distance functions for text: le_dst and dle_dst. You can use whichever you please here.
  • Output is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.
like image 21
Simon.S.A. Avatar answered Oct 21 '22 14:10

Simon.S.A.