Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.

  • Variable
  • FO.variable

Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX

Please let me know if there's a better way of sharing this data.

All I want to do is create an additional binary keep column which should be:

  • 1 if FO.variable is inside Variable
  • 0 if FO.Variable is not inside Variable

Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.

Here's what I've tried:

  1. Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.

  2. My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.

    transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])

  3. My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?

    kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))

  4. EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:

for(i in 1:nrow(dd)){ if(dd[i,4] %in% dd[i,2]) dd$test[i] <- 1 }

As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".

A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?

This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

like image 234
leaRningR909 Avatar asked Feb 10 '23 11:02

leaRningR909


1 Answers

I read the data

df = dget("http://dpaste.com/2SJ6DPX.txt")

then split the 'variable' column into its parts and figured out the lengths of each entry

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

Then I unlisted v and created an index that maps the unlisted v to the row from which it came from

uv = unlist(v)
idx = rep(seq_along(v), len)

Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable

test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE

Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))

f0 = function(dd) {
    v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
    len = lengths(v)
    uv = unlist(v)
    idx = rep(seq_along(v), len)

    keep = logical(nrow(dd))
    keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
    keep
}

(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)

f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))

with

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

like image 118
Martin Morgan Avatar answered Feb 13 '23 02:02

Martin Morgan