Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query number of Ids in batches in R

I have below mentioned dataframe in R.

ID       Amount     Date
IK-1     100        2020-01-01
IK-2     110        2020-01-02
IK-3     120        2020-01-03
IK-4     109        2020-01-03
IK-5     104        2020-01-03

I'm using ID to fetch some details from MySQL using the following code.

library(RMySQL)

conn<- connection

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
                                      collapse = ', '),") 
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"

The above code is working fine, If i pass ~1000 ID but it throws R termination error when passing 10000 or more ID at a time.

How can I create a loop and pass the Id in batched to get the one final output for 10000 ID.

Error Message:

Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows
like image 522
Vector JX Avatar asked Feb 03 '20 08:02

Vector JX


2 Answers

Pass the ID's data frame into a temp table before your SQL query and then use that to inner join on the ID's that you are using, this way you can avoid looping. All you gotta do is use dbWriteTable and set the parameter temporary = TRUE when calling it.

EX:

library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user', 
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID 
and e.Parameters in
       ('Section1',
       'Section2','Section3',
       'Section4');")

This should improve performance on your code as well as you won't need to loop in R anymore with the where statement. Let me know if it isn't working properly.

like image 88
masterbingo1 Avatar answered Oct 14 '22 05:10

masterbingo1


# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)

# create the initial dataframe
df <- tribble(
    ~ID,       ~Amount,     ~Date
    , "IK-1"    , 100       , 2020-01-01
    , "IK-2"    , 110       , 2020-01-02
    , "IK-3"    , 120       , 2020-01-03
    , "IK-4"    , 109       , 2020-01-03
    , "IK-5"    , 104       , 2020-01-03
)

# first helper function
createIDBatchVector <- function(x, batchSize){
    paste0(
        "'"
        , sapply(
            split(x, ceiling(seq_along(x) / batchSize))
            , paste
            , collapse = "','"
        )
        , "'"
    )
}

# second helper function
createQueries <- function(IDbatches){
    paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
    LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
    LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,") 
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}

# ------------------------------------------------------------------

# and now the actual script

# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)

# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'" 

# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)

cat(queries) # use cat to show what they look like

# it looks like this:

# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');

# and now the loop
df_final <- data.frame() # initialize a dataframe

conn <- connection # open a connection
for (query in queries){ # iterate over the queries
    df_final <- rbind(df_final, dbGetQuery(conn,query))
}

# And here the connection should be closed. (I don't know the function call for this.)
like image 36
Georgery Avatar answered Oct 14 '22 05:10

Georgery