Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:

DF1
ID c1 c2 c3 c4 c5
1   0  1  0  1  1
10  1  0  1  0  0
5   0  1  1  1  0
20  1  1  0  0  1
3   1  1  0  0  1
6   0  0  1  1  1
71  1  0  1  0  0
15  0  1  1  1  0
80  0  0  0  1  0

DF2  
ID c1 c2 c3 c4 c5
5   1  0  1  1  0
6   0  1  0  0  1
15  1  0  0  1  1
80  1  1  1  0  0
78  1  1  1  0  0
98  0  0  1  1  1
1   0  1  0  0  1
2   1  0  0  1  1
9   0  0  0  1  0

My function must return something like this: (the following is a subset)

DF_Return
ID Count
1    4
2    NA
80   1
20   NA
   .
   .
   .

Could you give me any suggestions to carry this out? I'm not that expert in sql.

I put the codes in R to generate the experiment I used above.

 id1=c(1,10,5,20,3,6,71,15,80)
 c1=c(0,1,0,1,1,0,1,0,0)
 c2=c(1,0,1,1,1,0,0,1,0)
 c3=c(0,1,1,0,0,1,1,1,0)
 c4=c(1,0,1,0,0,1,0,1,1)
 c5=c(1,0,0,1,1,1,0,0,0)
 DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
 DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)

Many thanks in advance. Best Regards!

like image 485
nhern121 Avatar asked Feb 20 '23 20:02

nhern121


1 Answers

Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:

#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] ==  DF3[, 7:ncol(DF3)]))

#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))

#Are they the same?
all.equal(out1, out2)
#[1] TRUE

> head(out1)
  ID count
1  1     4
2  2    NA
3  3    NA
4  5     3
5  6     2
6  9    NA
like image 98
Chase Avatar answered Feb 23 '23 15:02

Chase