In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:
library(fuzzyjoin)
library(dplyr)
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3) )
view(sample_query)
However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported .
I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R
My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?
Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.
Does anyone know if this is possible?
Note:
My data looks like this:
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
Fuzzyjoin R package can be used in SQL Server 2017 and higher to perform different kinds of fuzzy joins of data. First, let's understand what distinct types of fuzzy joins are supported by this package. Below is a list of distinct types of inexact matching supported by the fuzzyjoin package along with the associated function name.
In a real life scenario, data would come from a table in SQL Server for both datasets that would be joined for the fuzzy matching by using the InputDataSet data frame to pass the data from SQL Server to the R Script. And after the fuzzy join is applied, the output can be extracted from the R script using the OutputDataSet data frame.
The fuzzyjoin package is a variation on dplyr's join operations that allows matching not just on values that match between columns, but on inexact matching. This allows matching on: One relevant use case is for classifying freeform text data (such as survey responses) against a finite set of options. The package also includes:
Other Regex based join functions include regex_left_join (include all rows of left table), regex_right_join (include all rows of right table), regex_full_join (include all rows in each table), regex_semi_join (filter left table for rows with matches), regex_anti_join (filter left table for rows without matches)
Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:
SAS: Fuzzy Joins
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)
and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)
To get this to run in an R script, I would recommend using dbplyr and creating this using tbl
so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect()
function.
Edit: Just a note, the tbl
command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect()
at which point, R will send the query to the server, the server will run the query and send the data.
Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr
package, str_dectect
, isn't implemented in dbplyr
and so dbplyr
would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect()
. Check out the dbplyr page linked above for details.
library(dbplyr)
new_con<- dbConnect(
odbc(),
Driver= "ODBC Driver 17 for SQL Server (as an example)",
Server = "Server name here",
uid = "some_id",
pwd = "abc"
)
sample_query<- dbplyr::tbl(
new_con,
dbplyr::sql(
"select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)"
sample_data<-sample_query %>%
filter(silly_example==TRUE) %>%
collect()
I agree with @Roger-123's approach. But here is a variation that might assist:
Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:
remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")
# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)
output = remote_1 %>%
# cross_join
inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
# calculate Levenshtein distance
mutate(distance = le_dst(id1, id2)) %>%
# filter to close matches
filter(distance <= 2)
Notes:
inner_join
if the conditions are simple, or create another filter condition if they are complex.le_dst
is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.le_dst
and dle_dst
. You can use whichever you please here.Output
is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With