Both Uwe's and GKi's answer are correct. Gki received the bounty because Uwe was late for that, but Uwe's solution runs about 15x as fast I have two datasets that contain scores for different patients on multiple measuring moments like so: <pre class="prettyprint"><code>df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"), "Days" = c(0,25,235,353,100,538), "Score" = c(NA,2,3,4,5,6), stringsAsFactors = FALSE) df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"), "Days" = c(0,25,248,353,100,150,503), "Score" = c(1,10,3,4,5,7,6), stringsAsFactors = FALSE) > df1 ID Days Score 1 patient1 0 NA 2 patient1 25 2 3 patient1 235 3 4 patient1 353 4 5 patient2 100 5 6 patient3 538 6 > df2 ID Days Score 1 patient1 0 1 2 patient1 25 10 3 patient1 248 3 4 patient1 353 4 5 patient2 100 5 6 patient2 150 7 7 patient3 503 6 </code></pre> Column <code>ID</code> shows the patient ID, column <code>Days</code> shows the moment of measurement (Days since patient inclusion) and column <code>Score</code> shows the measured score. Both datasets show the same data but in different moments in time (df1 was 2 years ago, df2 has the same data with updates from this year). I have to compare the scores for each patient and each moment between both datasets. However, in some cases the <code>Days</code> variable has minor changes over time, so comparing the dataset by a simple join does not work. Example: <pre class="prettyprint"><code>library(dplyr) > full_join(df1, df2, by=c("ID","Days")) %>% + arrange(.[[1]], as.numeric(.[[2]])) ID Days Score.x Score.y 1 patient1 0 NA 1 2 patient1 25 2 10 3 patient1 235 3 NA 4 patient1 248 NA 3 5 patient1 353 4 4 6 patient2 100 5 5 7 patient2 150 NA 7 8 patient3 503 NA 6 9 patient3 538 6 NA </code></pre> Here, rows 3 and 4 contain data for the same measurement (with score 3) but are not joined because the values for the <code>Days</code> column are different (235 vs 248). Question: I'm looking for a way to set a threshold on the second column (say 30 days) which would result in the following output: <pre class="prettyprint"><code>> threshold <- 30 > *** insert join code *** ID Days Score.x Score.y 1 patient1 0 NA 1 2 patient1 25 2 10 3 patient1 248 3 3 4 patient1 353 4 4 5 patient2 100 5 5 6 patient2 150 NA 7 7 patient3 503 NA 6 8 patient3 538 6 NA </code></pre> This output shows that rows 3 and 4 of the previous output have been merged (because 248-235 < 30) and have taken the value for <code>Days</code> of the second df (248). Three main conditions to keep in mind are: <ul> <li>Consecutive days that are within the threshold from within the same df (rows 1 and 2) are not merged.</li> <li>In some cases, up to four values for the <code>Days</code> variable exist in the same dataframe and thus should not be merged. It might be the case that one of these values does exist within the treshold in the other dataframe, and these will have to be merged. See row 3 in the example below.</li> <li>Each score/days/patient combination can only be used once. If a merge satisfies all conditions but there is still a double-merge possible, the first one should be used.</li> </ul> <pre class="prettyprint"><code>> df1 ID Days Score 1 patient1 0 1 2 patient1 5 2 3 patient1 10 3 4 patient1 15 4 5 patient1 50 5 > df2 ID Days Score 1 patient1 0 1 2 patient1 5 2 3 patient1 12 3 4 patient1 15 4 5 patient1 50 5 > df_combined ID Days Score.x Score.y 1 patient1 0 1 1 2 patient1 5 2 2 3 patient1 12 3 3 4 patient1 15 4 4 5 patient1 50 5 5 </code></pre> EDIT FOR CHINSOON12 <pre class="prettyprint"><code>> df1 ID Days Score 1: patient1 0 1 2: patient1 116 2 3: patient1 225 3 4: patient1 309 4 5: patient1 351 5 6: patient2 0 6 7: patient2 49 7 > df2 ID Days Score 1: patient1 0 11 2: patient1 86 12 3: patient1 195 13 4: patient1 279 14 5: patient1 315 15 6: patient2 0 16 7: patient2 91 17 8: patient2 117 18 </code></pre> I wrapped your solution in a function like so: <pre class="prettyprint"><code>testSO2 <- function(DT1,DT2) { setDT(DT1);setDT(DT2) names(DT1) <- c("ID","Days","X") names(DT2) <- c("ID","Days","Y") DT1$Days <- as.numeric(DT1$Days) DT2$Days <- as.numeric(DT2$Days) DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)] DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)] byk <- c("ID", "s1", "e1") setkeyv(DT1, byk) setkeyv(DT2, byk) o1 <- foverlaps(DT1, DT2) byk <- c("ID", "s2", "e2") setkeyv(DT1, byk) setkeyv(DT2, byk) o2 <- foverlaps(DT2, DT1) olaps <- funion(o1, setcolorder(o2, names(o1)))[ is.na(Days), Days := i.Days] outcome <- olaps[, { if (all(!is.na(Days)) && any(Days == i.Days)) { s <- .SD[Days == i.Days, .(Days = Days[1L], X = X[1L], Y = Y[1L])] } else { s <- .SD[, .(Days = max(Days, i.Days), X, Y)] } unique(s) }, keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][] return(outcome) } </code></pre> Which results in: <pre class="prettyprint"><code>> testSO2(df1,df2) ID Days X Y 1: patient1 0 1 11 2: patient1 116 2 12 3: patient1 225 3 13 4: patient1 309 4 14 5: patient1 315 4 15 6: patient1 351 5 NA 7: patient2 0 6 16 8: patient2 49 7 NA 9: patient2 91 NA 17 10: patient2 117 NA 18 </code></pre> As you can see, rows 4 and 5 are wrong. The value for <code>Score</code> in df1 is used twice (4). The correct output around those rows should be as follows, as each score (X or Y in this case) can only be used once: <pre class="prettyprint"><code> ID Days X Y 4: patient1 309 4 14 5: patient1 315 NA 15 6: patient1 351 5 NA </code></pre> Code for dataframes below. <pre class="prettyprint"><code>> dput(df1) structure(list(ID = c("patient1", "patient1", "patient1", "patient1", "patient1", "patient2", "patient2"), Days = c("0", "116", "225", "309", "351", "0", "49"), Score = 1:7), row.names = c(NA, 7L), class = "data.frame") > dput(df2) structure(list(ID = c("patient1", "patient1", "patient1", "patient1", "patient1", "patient2", "patient2", "patient2"), Days = c("0", "86", "195", "279", "315", "0", "91", "117"), Score = 11:18), row.names = c(NA, 8L), class = "data.frame") </code></pre>

Sounds like a data cleaning exercise of a realistic but messy dataset that unfortunately, most of us have experience with before. Here is another <code>data.table</code> option: <pre class="prettyprint"><code>DT1[, c("Xrn", "s1", "e1", "s2", "e2") := .(.I, Days - 30L, Days + 30L, Days, Days)] DT2[, c("Yrn", "s1", "e1", "s2", "e2") := .(.I, Days, Days, Days - 30L, Days + 30L)] byk <- c("ID", "s1", "e1") setkeyv(DT1, byk) setkeyv(DT2, byk) o1 <- foverlaps(DT1, DT2) byk <- c("ID", "s2", "e2") setkeyv(DT1, byk) setkeyv(DT2, byk) o2 <- foverlaps(DT2, DT1) olaps <- funion(o1, setcolorder(o2, names(o1)))[ is.na(Days), Days := i.Days] ans <- olaps[, { if (any(Days == i.Days)) { .SD[Days == i.Days, .(Days=Days[1L], Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])] } else { .SD[, .(Days=md, Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])] } }, keyby = .(ID, md = pmax(Days, i.Days))] #or also ans[duplicated(Xrn), X := NA_integer_][duplicated(Yrn), Y := NA_integer_] ans[rowid(Xrn) > 1L, X := NA_integer_] ans[rowid(Yrn) > 1L, Y := NA_integer_] ans[, c("md", "Xrn", "Yrn") := NULL][] </code></pre> output for dataset below: <pre class="prettyprint"><code> ID Days X Y 1: 1 0 1 11 2: 1 10 2 12 3: 1 25 3 13 4: 1 248 4 14 5: 1 353 5 15 6: 2 100 6 16 7: 2 150 NA 17 8: 3 503 NA 18 9: 3 538 7 NA </code></pre> output for second dataset in OP's edit: <pre class="prettyprint"><code> ID Days X Y 1: patient1 0 1 11 2: patient1 116 2 12 3: patient1 225 3 13 4: patient1 309 4 14 5: patient1 315 NA 15 6: patient1 351 5 NA 7: patient2 0 6 16 8: patient2 49 7 NA 9: patient2 91 NA 17 10: patient2 117 NA 18 </code></pre> data (i have added more data from the other linked post and also simplify the data for easier viewing): <pre class="prettyprint"><code>library(data.table) DT1 <- data.table(ID = c(1,1,1,1,1,2,3), Days = c(0,10,25,235,353,100,538))[, X := .I] DT2 <- data.table(ID = c(1,1,1,1,1,2,2,3), Days = c(0,10,25,248,353,100,150,503))[, Y := .I + 10L] </code></pre> Explanation: <ol> <li>perform 2 overlapping joins using each table as the left table in turn.</li> <li>Union the 2 results from before setting NA days in right table to those from left table.</li> <li>Group by patient and overlapping dates. If identical dates exist, then keep records. Else use the maximum date.</li> <li>Each Score should only be used once, hence remove duplicates.</li> </ol> Please let me know if you find cases where this approach is not giving the correct results.

R merge two datasets based on specific columns with added condition

Q: How do I merge two Dataframes based on a column in R?

The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.

Q: How do I combine two columns of datasets in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

Q: How do I merge two datasets in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.

Q: How to merge data by two ID columns in R?

For this, we have to specify the by argument of the merge function to be equal to a vector of ID column names (i.e. by = c (“ID1”, “ID2”)). Have a look at the previous output of the RStudio console. We have created a merged data frame based on two ID columns. This Example illustrates how to use the dplyr package to merge data by two ID columns.

Q: What is the use of merge () function in base R?

The merge () function in base R can be used to merge input dataframes by common columns or row names. The merge () function retains all the row names of the dataframes, behaving similarly to the inner join.

Q: How to merge datasets?

At the high level, there are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively.

Q: Why can’t I combine these data sets?

Our problem surrounding combining these data sets are because of both the column names not being exactly the same for joins and not being the same length for binds. So let’s try to get all the column names to be the same.

Tags:

merge

join

dataframe

r

Both Uwe's and GKi's answer are correct. Gki received the bounty because Uwe was late for that, but Uwe's solution runs about 15x as fast

I have two datasets that contain scores for different patients on multiple measuring moments like so:

df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
                  "Days" = c(0,25,235,353,100,538),
                  "Score" = c(NA,2,3,4,5,6), 
                  stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
                  "Days" = c(0,25,248,353,100,150,503),
                  "Score" = c(1,10,3,4,5,7,6), 
                  stringsAsFactors = FALSE)
> df1
        ID Days Score
1 patient1    0    NA
2 patient1   25     2
3 patient1  235     3
4 patient1  353     4
5 patient2  100     5
6 patient3  538     6

> df2
        ID Days Score
1 patient1    0     1
2 patient1   25    10
3 patient1  248     3
4 patient1  353     4
5 patient2  100     5
6 patient2  150     7
7 patient3  503     6

Column ID shows the patient ID, column Days shows the moment of measurement (Days since patient inclusion) and column Score shows the measured score. Both datasets show the same data but in different moments in time (df1 was 2 years ago, df2 has the same data with updates from this year).

I have to compare the scores for each patient and each moment between both datasets. However, in some cases the Days variable has minor changes over time, so comparing the dataset by a simple join does not work. Example:

library(dplyr)

> full_join(df1, df2, by=c("ID","Days")) %>% 
+   arrange(.[[1]], as.numeric(.[[2]]))

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  235       3      NA
4 patient1  248      NA       3
5 patient1  353       4       4
6 patient2  100       5       5
7 patient2  150      NA       7
8 patient3  503      NA       6
9 patient3  538       6      NA

Here, rows 3 and 4 contain data for the same measurement (with score 3) but are not joined because the values for the Days column are different (235 vs 248).

Question: I'm looking for a way to set a threshold on the second column (say 30 days) which would result in the following output:

> threshold <- 30
> *** insert join code ***

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  248       3       3
4 patient1  353       4       4
5 patient2  100       5       5
6 patient2  150      NA       7
7 patient3  503      NA       6
8 patient3  538       6      NA

This output shows that rows 3 and 4 of the previous output have been merged (because 248-235 < 30) and have taken the value for Days of the second df (248).

Three main conditions to keep in mind are:

Consecutive days that are within the threshold from within the same df (rows 1 and 2) are not merged.
In some cases, up to four values for the Days variable exist in the same dataframe and thus should not be merged. It might be the case that one of these values does exist within the treshold in the other dataframe, and these will have to be merged. See row 3 in the example below.
Each score/days/patient combination can only be used once. If a merge satisfies all conditions but there is still a double-merge possible, the first one should be used.

> df1
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   10     3
4 patient1   15     4
5 patient1   50     5

> df2
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   12     3
4 patient1   15     4
5 patient1   50     5

> df_combined
        ID Days Score.x Score.y
1 patient1    0       1       1
2 patient1    5       2       2
3 patient1   12       3       3
4 patient1   15       4       4
5 patient1   50       5       5

EDIT FOR CHINSOON12

> df1
          ID Days Score
 1: patient1    0     1
 2: patient1  116     2
 3: patient1  225     3
 4: patient1  309     4
 5: patient1  351     5
 6: patient2    0     6
 7: patient2   49     7
> df2
          ID Days Score
 1: patient1    0    11
 2: patient1   86    12
 3: patient1  195    13
 4: patient1  279    14
 5: patient1  315    15
 6: patient2    0    16
 7: patient2   91    17
 8: patient2  117    18

I wrapped your solution in a function like so:

testSO2 <- function(DT1,DT2) {
    setDT(DT1);setDT(DT2)
    names(DT1) <- c("ID","Days","X")
    names(DT2) <- c("ID","Days","Y")
    DT1$Days <- as.numeric(DT1$Days)
    DT2$Days <- as.numeric(DT2$Days)
    DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)]
    DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)]
    byk <- c("ID", "s1", "e1")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o1 <- foverlaps(DT1, DT2)

    byk <- c("ID", "s2", "e2")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o2 <- foverlaps(DT2, DT1)

    olaps <- funion(o1, setcolorder(o2, names(o1)))[
        is.na(Days), Days := i.Days]

    outcome <- olaps[, {
        if (all(!is.na(Days)) && any(Days == i.Days)) {
            s <- .SD[Days == i.Days, .(Days = Days[1L],
                                       X = X[1L],
                                       Y = Y[1L])]
        } else {
            s <- .SD[, .(Days = max(Days, i.Days), X, Y)]
        }
        unique(s)
    },
    keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][]
    return(outcome)
}

Which results in:

> testSO2(df1,df2)
          ID Days  X  Y
 1: patient1    0  1 11
 2: patient1  116  2 12
 3: patient1  225  3 13
 4: patient1  309  4 14
 5: patient1  315  4 15
 6: patient1  351  5 NA
 7: patient2    0  6 16
 8: patient2   49  7 NA
 9: patient2   91 NA 17
10: patient2  117 NA 18

As you can see, rows 4 and 5 are wrong. The value for Score in df1 is used twice (4). The correct output around those rows should be as follows, as each score (X or Y in this case) can only be used once:

          ID Days  X  Y
 4: patient1  309  4 14
 5: patient1  315 NA 15
 6: patient1  351  5 NA

Code for dataframes below.

> dput(df1)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2"), Days = c("0", "116", "225", 
"309", "351", "0", "49"), Score = 1:7), row.names = c(NA, 7L), class = "data.frame")
> dput(df2)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2", "patient2"), Days = c("0", 
"86", "195", "279", "315", "0", "91", "117"), Score = 11:18), row.names = c(NA, 
8L), class = "data.frame")

491

asked May 28 '20 14:05

BorisRu

2 Answers

Sounds like a data cleaning exercise of a realistic but messy dataset that unfortunately, most of us have experience with before. Here is another data.table option:

DT1[, c("Xrn", "s1", "e1", "s2", "e2") := .(.I, Days - 30L, Days + 30L, Days, Days)]
DT2[, c("Yrn", "s1", "e1", "s2", "e2") := .(.I, Days, Days, Days - 30L, Days + 30L)]
byk <- c("ID", "s1", "e1")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o1 <- foverlaps(DT1, DT2)

byk <- c("ID", "s2", "e2")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o2 <- foverlaps(DT2, DT1)

olaps <- funion(o1, setcolorder(o2, names(o1)))[
    is.na(Days), Days := i.Days]

ans <- olaps[, {
        if (any(Days == i.Days)) {
            .SD[Days == i.Days, 
                .(Days=Days[1L], Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
        } else {
            .SD[, .(Days=md, Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
        }
    },
    keyby = .(ID, md = pmax(Days, i.Days))]

#or also ans[duplicated(Xrn), X := NA_integer_][duplicated(Yrn), Y := NA_integer_]
ans[rowid(Xrn) > 1L, X := NA_integer_]
ans[rowid(Yrn) > 1L, Y := NA_integer_]
ans[, c("md", "Xrn", "Yrn") := NULL][]

output for dataset below:

   ID Days  X  Y
1:  1    0  1 11
2:  1   10  2 12
3:  1   25  3 13
4:  1  248  4 14
5:  1  353  5 15
6:  2  100  6 16
7:  2  150 NA 17
8:  3  503 NA 18
9:  3  538  7 NA

output for second dataset in OP's edit:

          ID Days  X  Y
 1: patient1    0  1 11
 2: patient1  116  2 12
 3: patient1  225  3 13
 4: patient1  309  4 14
 5: patient1  315 NA 15
 6: patient1  351  5 NA
 7: patient2    0  6 16
 8: patient2   49  7 NA
 9: patient2   91 NA 17
10: patient2  117 NA 18

data (i have added more data from the other linked post and also simplify the data for easier viewing):

library(data.table)
DT1 <- data.table(ID = c(1,1,1,1,1,2,3),
    Days = c(0,10,25,235,353,100,538))[, X := .I]
DT2 <- data.table(ID = c(1,1,1,1,1,2,2,3),
    Days = c(0,10,25,248,353,100,150,503))[, Y := .I + 10L]

Explanation:

perform 2 overlapping joins using each table as the left table in turn.
Union the 2 results from before setting NA days in right table to those from left table.
Group by patient and overlapping dates. If identical dates exist, then keep records. Else use the maximum date.
Each Score should only be used once, hence remove duplicates.

Please let me know if you find cases where this approach is not giving the correct results.

102

answered Sep 19 '22 02:09

chinsoon12

A base solution using lapply to find where differences in Days is below threshold and make an expand.grid to get all possible combinations. Afterwards remove those which would pick the same twice or are picking behind another one. From those calculate the day difference and pick the line which has the consecutive lowest difference. Afterwards rbind the not matched from df2.

threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
  x <- df1[df1$ID == ID,]
  y <- df2[df2$ID == ID,]
  if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
  if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
  x <- x[order(x$Days),]
  y <- y[order(y$Days),]
  z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
         which(abs(z - y$Days) < threshold))))
  z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
         any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
  s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
         abs(x$Days[j] - y$Days[z[,j]])}))
  s[is.na(s)] <- nmScore
  s <- matrix(apply(s, 1, sort), nrow(s), byrow = TRUE)
  i <- rep(TRUE, nrow(s))
  for(j in seq_len(ncol(s))) {i[i]  <- s[i,j] == min(s[i,j])}
  i <- unlist(z[which.max(i),])
  j <- setdiff(seq_len(nrow(y)), i)
  rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
  if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]

Data:

0..First test case from Boris Ruwe, 1..2nd test case from Boris Ruwe, 2..3nd test case from Boris Ruwe, 3..Test case from Uwe, 4..Test case from Boris Ruwe from R rolling join two data.tables with error margin on join, 5..Test case from GKi.

df1 <- structure(list(ID = c("0patient1", "0patient1", "0patient1", 
"0patient1", "0patient2", "0patient3", "1patient1", "1patient1", 
"1patient1", "1patient1", "1patient1", "2patient1", "2patient1", 
"2patient1", "2patient1", "2patient1", "2patient2", "2patient2", 
"3patient1", "3patient1", "3patient1", "3patient1", "3patient1", 
"3patient1", "3patient2", "3patient3", "4patient1", "4patient1", 
"4patient1", "4patient1", "4patient2", "4patient3", "5patient1", 
"5patient1", "5patient1", "5patient2"), Days = c(0, 25, 235, 
353, 100, 538, 0, 5, 10, 15, 50, 0, 116, 225, 309, 351, 0, 49, 
0, 1, 25, 235, 237, 353, 100, 538, 0, 10, 25, 340, 100, 538, 
3, 6, 10, 1), Score = c(NA, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1, 
2, 3, 4, 5, 6, 7, NA, 2, 3, 4, 5, 6, 7, 8, NA, 2, 3, 99, 5, 6, 
1, 2, 3, 1)), row.names = c(NA, -36L), class = "data.frame")
df2 <- structure(list(ID = c("0patient1", "0patient1", "0patient1", 
"0patient1", "0patient2", "0patient2", "0patient3", "1patient1", 
"1patient1", "1patient1", "1patient1", "1patient1", "2patient1", 
"2patient1", "2patient1", "2patient1", "2patient1", "2patient2", 
"2patient2", "2patient2", "3patient1", "3patient1", "3patient1", 
"3patient1", "3patient1", "3patient1", "3patient2", "3patient2", 
"3patient3", "4patient1", "4patient1", "4patient1", "4patient1", 
"4patient2", "4patient2", "4patient3", "5patient1", "5patient1", 
"5patient1", "5patient3"), Days = c(0, 25, 248, 353, 100, 150, 
503, 0, 5, 12, 15, 50, 0, 86, 195, 279, 315, 0, 91, 117, 0, 25, 
233, 234, 248, 353, 100, 150, 503, 0, 10, 25, 353, 100, 150, 
503, 1, 4, 8, 1), Score = c(1, 10, 3, 4, 5, 7, 6, 1, 2, 3, 4, 
5, 11, 12, 13, 14, 15, 16, 17, 18, 11, 12, 13, 14, 15, 16, 17, 
18, 19, 1, 10, 3, 4, 5, 7, 6, 11, 12, 13, 1)), row.names = c(NA, 
-40L), class = "data.frame")
df1
#          ID Days Score
#1  0patient1    0    NA
#2  0patient1   25     2
#3  0patient1  235     3
#4  0patient1  353     4
#5  0patient2  100     5
#6  0patient3  538     6
#7  1patient1    0     1
#8  1patient1    5     2
#9  1patient1   10     3
#10 1patient1   15     4
#11 1patient1   50     5
#12 2patient1    0     1
#13 2patient1  116     2
#14 2patient1  225     3
#15 2patient1  309     4
#16 2patient1  351     5
#17 2patient2    0     6
#18 2patient2   49     7
#19 3patient1    0    NA
#20 3patient1    1     2
#21 3patient1   25     3
#22 3patient1  235     4
#23 3patient1  237     5
#24 3patient1  353     6
#25 3patient2  100     7
#26 3patient3  538     8
#27 4patient1    0    NA
#28 4patient1   10     2
#29 4patient1   25     3
#30 4patient1  340    99
#31 4patient2  100     5
#32 4patient3  538     6
#33 5patient1    3     1
#34 5patient1    6     2
#35 5patient1   10     3
#36 5patient2    1     1

df2
#          ID Days Score
#1  0patient1    0     1
#2  0patient1   25    10
#3  0patient1  248     3
#4  0patient1  353     4
#5  0patient2  100     5
#6  0patient2  150     7
#7  0patient3  503     6
#8  1patient1    0     1
#9  1patient1    5     2
#10 1patient1   12     3
#11 1patient1   15     4
#12 1patient1   50     5
#13 2patient1    0    11
#14 2patient1   86    12
#15 2patient1  195    13
#16 2patient1  279    14
#17 2patient1  315    15
#18 2patient2    0    16
#19 2patient2   91    17
#20 2patient2  117    18
#21 3patient1    0    11
#22 3patient1   25    12
#23 3patient1  233    13
#24 3patient1  234    14
#25 3patient1  248    15
#26 3patient1  353    16
#27 3patient2  100    17
#28 3patient2  150    18
#29 3patient3  503    19
#30 4patient1    0     1
#31 4patient1   10    10
#32 4patient1   25     3
#33 4patient1  353     4
#34 4patient2  100     5
#35 4patient2  150     7
#36 4patient3  503     6
#37 5patient1    1    11
#38 5patient1    4    12
#39 5patient1    8    13
#40 5patient3    1     1

Result:

#           ID Days Score Days.1 Score.1
#1   0patient1    0    NA      0       1
#2   0patient1   25     2     25      10
#3   0patient1  235     3    248       3
#4   0patient1  353     4    353       4
#5   0patient2  100     5    100       5
#110 0patient2   NA    NA    150       7
#111 0patient3   NA    NA    503       6
#6   0patient3  538     6     NA      NA
#7   1patient1    0     1      0       1
#8   1patient1    5     2      5       2
#9   1patient1   10     3     12       3
#10  1patient1   15     4     15       4
#11  1patient1   50     5     50       5
#12  2patient1    0     1      0      11
#112 2patient1   NA    NA     86      12
#13  2patient1  116     2     NA      NA
#210 2patient1   NA    NA    195      13
#14  2patient1  225     3     NA      NA
#37  2patient1   NA    NA    279      14
#15  2patient1  309     4    315      15
#16  2patient1  351     5     NA      NA
#17  2patient2    0     6      0      16
#18  2patient2   49     7     NA      NA
#113 2patient2   NA    NA     91      17
#211 2patient2   NA    NA    117      18
#19  3patient1    0    NA      0      11
#20  3patient1    1     2     NA      NA
#21  3patient1   25     3     25      12
#114 3patient1   NA    NA    233      13
#22  3patient1  235     4    234      14
#23  3patient1  237     5    248      15
#24  3patient1  353     6    353      16
#25  3patient2  100     7    100      17
#115 3patient2   NA    NA    150      18
#116 3patient3   NA    NA    503      19
#26  3patient3  538     8     NA      NA
#27  4patient1    0    NA      0       1
#28  4patient1   10     2     10      10
#29  4patient1   25     3     25       3
#30  4patient1  340    99    353       4
#31  4patient2  100     5    100       5
#117 4patient2   NA    NA    150       7
#118 4patient3   NA    NA    503       6
#32  4patient3  538     6     NA      NA
#119 5patient1   NA    NA      1      11
#33  5patient1    3     1      4      12
#34  5patient1    6     2      8      13
#35  5patient1   10     3     NA      NA
#36  5patient2    1     1     NA      NA
#NA  5patient3   NA    NA      1       1

Formatted result:

data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
 Score.x=x[,3], Score.y=x[,5])
#          ID Days Score.x Score.y
#1  0patient1    0      NA       1
#2  0patient1   25       2      10
#3  0patient1  235       3       3
#4  0patient1  353       4       4
#5  0patient2  100       5       5
#6  0patient2  150      NA       7
#7  0patient3  503      NA       6
#8  0patient3  538       6      NA
#9  1patient1    0       1       1
#10 1patient1    5       2       2
#11 1patient1   10       3       3
#12 1patient1   15       4       4
#13 1patient1   50       5       5
#14 2patient1    0       1      11
#15 2patient1   86      NA      12
#16 2patient1  116       2      NA
#17 2patient1  195      NA      13
#18 2patient1  225       3      NA
#19 2patient1  279      NA      14
#20 2patient1  309       4      15
#21 2patient1  351       5      NA
#22 2patient2    0       6      16
#23 2patient2   49       7      NA
#24 2patient2   91      NA      17
#25 2patient2  117      NA      18
#26 3patient1    0      NA      11
#27 3patient1    1       2      NA
#28 3patient1   25       3      12
#29 3patient1  233      NA      13
#30 3patient1  235       4      14
#31 3patient1  237       5      15
#32 3patient1  353       6      16
#33 3patient2  100       7      17
#34 3patient2  150      NA      18
#35 3patient3  503      NA      19
#36 3patient3  538       8      NA
#37 4patient1    0      NA       1
#38 4patient1   10       2      10
#39 4patient1   25       3       3
#40 4patient1  340      99       4
#41 4patient2  100       5       5
#42 4patient2  150      NA       7
#43 4patient3  503      NA       6
#44 4patient3  538       6      NA
#45 5patient1    1      NA      11
#46 5patient1    3       1      12
#47 5patient1    6       2      13
#48 5patient1   10       3      NA
#49 5patient2    1       1      NA
#50 5patient3    1      NA       1

Alternatives to get Days:

#From df1 and in case it is NA I took it from df2
data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
 Score.x=x[,3], Score.y=x[,5])

#From df2 and in case it is NA I took it from df1
data.frame(ID=x[,1], Days=ifelse(is.na(x[,4]), x[,2], x[,4]),
 Score.x=x[,3], Score.y=x[,5])

#Mean
data.frame(ID=x[,1], Days=rowMeans(x[,c(2,4)], na.rm=TRUE),
 Score.x=x[,3], Score.y=x[,5])

In case the total difference in days should be minimized, allowing not to take the nearest, a possible way will be:

threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
  x <- df1[df1$ID == ID,]
  y <- df2[df2$ID == ID,]
  x <- x[order(x$Days),]
  y <- y[order(y$Days),]
  if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
  if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
  z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
         which(abs(z - y$Days) < threshold))))
  z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
         any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
  s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
         abs(x$Days[j] - y$Days[z[,j]])}))
  s[is.na(s)] <- nmScore
  i <- unlist(z[which.min(rowSums(s)),])
  j <- setdiff(seq_len(nrow(y)), i)
  rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
  if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]

answered Sep 19 '22 02:09

GKi

Related questions
                            
                                increase iterations for new version of lmer?
                            
                                Subscript of math equation in R documentation
                            
                                Remove duplicates based on 2nd column condition
                            
                                How can I make R use more CPU and memory? [duplicate]
                            
                                Histogram ggplot : Show count label for each bin for each category
                            
                                R image() plots matrix rotated?
                            
                                output markdown in r code chunk
                            
                                Can't drop column - select() with dplyr
                            
                                REAL() can only be applied to a 'numeric', not a 'integer'
                            
                                Reshaping data in R with "login" "logout" times
                            
                                Changing the Appearance of Facet Labels size
                            
                                pandoc-citeproc error 83 with Rmarkdown file
                            
                                Change legend size in plotly chart
                            
                                Row operations in data.table using `by = .I`
                            
                                Shiny Slider Input step by month
                            
                                How to 'unlist' a column in a data.table
                            
                                R markdown, hiding the library output
                            
                                Suppress automatic output to console in R
                            
                                Installing the R-package "rgeos" on linux: geos-config not found or not executable
                            
                                Closest value to a specific column in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With