I'm trying to solve a data management problem in R.
Suppose my data looks as follows:
id <- c("123", "414", "606")
next.up <- c("414", "606", "119")
is.cond.met <- as.factor(c("FALSE", "FALSE", "TRUE"))
df <- data.frame(id, next.up, is.cond.met)
> df
id next.up is.cond.met
1 123 414 FALSE
2 414 606 FALSE
3 606 119 TRUE
And I'd like to obtain is the following:
id <- c("123", "414", "606")
next.up <- c("414", "606", "119")
is.cond.met <- as.factor(c("FALSE", "FALSE", "TRUE"))
origin <- c("606", "606", "119")
df.result <- data.frame(id, next.up, is.cond.met, origin)
> df.result
id next.up is.cond.met origin
1 123 414 FALSE 606
2 414 606 FALSE 606
3 606 119 TRUE 119
In other words: I want to match each ID to its "origin" when a given condition (is.met) is true. The difficulty I'm having is that this is iterative and hierarchical: to find the origin I may have to go through multiple degrees of separations. the logical steps are illustrated below. I'm really not sure how to tackle this in R.
UPDATE
One of the comments propose a data.frame solution which works for sorted data, as in the minimal example above. In truth, my data is not sorted in such a manner. A better example is as follows:
id <- c("961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268")
next.up <- c("20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112")
is.cond.met <- c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
df <- data.frame(id, next.up, is.cond.met, stringsAsFactors = FALSE)
glimpse(df)
Observations: 8
Variables: 3
$ id <chr> "961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268"
$ next.up <chr> "20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112"
$ is.cond.met <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
> df
id next.up is.cond.met
1 961980 20090 TRUE
2 14788 655036 FALSE
3 902460 40375164 FALSE
4 900748 40031850 FALSE
5 728912 40368996 FALSE
6 141726 961980 FALSE
7 1041190 141726 FALSE
8 692268 760112 FALSE
UPDATE 2: the end result should look like this:
> df.end.result
id next.up is.cond.met origin
1 961980 20090 TRUE <NA>
2 14788 655036 FALSE <NA>
3 902460 40375164 FALSE <NA>
4 900748 40031850 FALSE <NA>
5 728912 40368996 FALSE <NA>
6 141726 961980 FALSE 961980
7 1041190 141726 FALSE 961980
8 692268 760112 FALSE <NA>
I have extended your example data a bit to show what happens with more TRUE
values in is.cond.met
. Using the data.table
package, you could do:
library(data.table)
setDT(df)[, grp := shift(cumsum(is.cond.met), fill=0)
][, origin := ifelse(is.cond.met, next.up, id[.N]), by = grp][]
which gives:
> df
id next.up is.cond.met grp origin
1: 123 414 FALSE 0 606
2: 414 606 FALSE 0 606
3: 606 119 TRUE 0 119
4: 119 321 FALSE 1 321
5: 321 507 TRUE 1 507
6: 507 185 TRUE 2 185
Explanation:
shift(cumsum(is.cond.met), fill=0)
.ifelse(is.cond.met, next.up, id[.N])
you assign the correct values to origin
.Note: The id
and next.up
columns should be of class character for the above to work (for that reason I used stringsAsFactors = FALSE
in constructing the extended example data). If they are factors, convert them first with as.character
. If is.cond.met
isn't already a logical, convert it with as.logical
.
On the updated example data, the above code gives:
id next.up is.cond.met grp origin
1: 961980 20090 TRUE 0 20090
2: 14788 655036 FALSE 1 692268
3: 902460 40375164 FALSE 1 692268
4: 900748 40031850 FALSE 1 692268
5: 728912 40368996 FALSE 1 692268
6: 141726 961980 FALSE 1 692268
7: 1041190 141726 FALSE 1 692268
8: 692268 760112 FALSE 1 692268
Used data:
id <- c("123", "414", "606", "119", "321", "507")
next.up <- c("414", "606", "119", "321", "507", "185")
is.cond.met <- c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE)
df <- data.frame(id, next.up, is.cond.met, stringsAsFactors = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With