Subsetting Data Frame Based on Contents of a "Column" List

Question

Set-Up

I have a list matrix, where one of the "columns" is a list (I realize it's an odd dataset to work with, but I find it useful for other operations). Each entry of the list is either; (1) empty (integer(0)), (2) an integer, or (3) a vector of integers.

E.g. the R object "d.f", With d.f$ID an index vector, and d.f$Basket_List the list.

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)
d.f$Basket_List <- Basket_List

My Question

Issue 1

I'd like to create a new dataset that's a subset of the initial, based on whether or not "Basket_List" contains certain value(s). E.g. a subset of all the rows in d.f such that Bask_list has "123" or "123" & "987" -- or other more complicated conditions.

I've tried every variation of the following, but to no avail.

d.f2 <- subset(d.f, 123 %in% Basket_List)
d.f2 <- subset(d.f, 123 == any(Basket_List))
d.f2 <- d.f[which(123 %in% d.f$Basket_List,]
# should return the subset, with rows 2,3,5,7 & 8

Issue 2

My other issue is that'd I'll be running this operation over many millions of rows (it's transaction data), so I'd like to optimize it as much as possible for speed (I have a complicated for loop now, but it takes too much time).

Alternative Set-Up of Data

If you think it might be useful, the data might also be set-up as the following:

ID <- c(1,2,2,3,3,4,5,5,6,7,7,8,8,9)
Basket <- c(NA,123,987,123,123,456,456,123,456,123,987,987,123,987)
alt.d.f <- data.frame(ID,Basket)

Ari B. Friedman · Accepted Answer

You can use sapply for this:

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)

sel <- sapply( Basket_List, function(bl,searchItem) {
  any(searchItem %in% bl)
}, searchItem=c(123) )

> sel
[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE

> d.f[sel,,drop=FALSE]
  ID
2  2
3  3
5  5
7  7
8  8

Please be careful with your terminology. A data.frame is not a matrix. It's a type of list.

Speed-wise, sapply is not the fastest, but the selection will be very fast since it is vectorized. If you need more speed, data.table time.

Subsetting Data Frame Based on Contents of a "Column" List

Tags:

object

list

dataframe

r

subset

Set-Up

My Question

Issue 1

Issue 2

Alternative Set-Up of Data

EconomiCurtis

1 Answers

Ari B. Friedman

Recent Activity

Donate For Us

Subsetting Data Frame Based on Contents of a "Column" List

Tags:

object

list

dataframe

r

subset

Set-Up

My Question

Issue 1

Issue 2

Alternative Set-Up of Data

EconomiCurtis

1 Answers

Ari B. Friedman

Related questions

Recent Activity

Donate For Us