Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching rows in a data frame in R

Tags:

r

I have strings of numbers not necessarily of the same length e.g.

0,0,1,2,1,0,0,0

1,1,0,1

2,1,2,0,1,0

I have imported these into a dataframe in R e.g. the above three strings would give the following three rows (which I shall call df):

enter image description here

I am looking to write some functions that will help me understand the data. As a starting point - given a numeric vector x - I would like a 'process' P of establishing the number of rows which contain x as a subvector e.g. if x = c(2,1) then P(x) = 2, if x = c(0,0,0) then P(x) = 1 and if x = c(1,3) then P(x) = 0. I have many more similar questions though I am hoping I will be able to take the logic from this question and work out some of the other stuff myself.

like image 888
user1873334 Avatar asked Dec 19 '12 11:12

user1873334


People also ask

How do I find rows in a dataset in R?

R provides us nrow() function to get the rows for an object. That is, with nrow() function, we can easily detect and extract the number of rows present in an object that can be matrix, data frame or even a dataset.

How do I get specific rows from a DataFrame in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do you select rows from a DataFrame based on column values in R?

Select Rows by list of Column Values. By using the same notation you can also use an operator %in% to select the DataFrame rows based on a list of values. The following example returns all rows when state values are present in vector values c('CA','AZ','PH') .


2 Answers

Edit: The regex way would be:

match.regex <- function(x,data){
  xs <- paste(x,collapse="_")
  dats <- apply(data,1,paste,collapse="_")
  sum(grepl(xs,dats))
}


> match.regex(c(1),dat)
[1] 3
> match.regex(c(0,0,0),dat)
[1] 1
> match.regex(c(1,2),dat)
[1] 2
> match.regex(5,dat)
[1] 0

Surprisingly, this one is faster than other methods given here, and about twice as fast as my solution below, both on small and on big datasets. Regexes got pretty much optimized apparently :

> benchmark(matching(c(1,2),dat),match.regex(c(1,2),dat),replications=1000)
                       test replications elapsed relative 
2 match.regex(c(1, 2), dat)         1000    0.15      1.0 
1    matching(c(1, 2), dat)         1000    0.36      2.4 

An approach that gives you the number immediately and works more vectorized, is the following:

matching.row <- function(x,row){
    nx <- length(x)
    sid <- which(x[1]==row)
    any(sapply(sid,function(i) all(row[seq(i,i+nx-1)]==x)))
}

matching <- function(x,data)
  sum(apply(data,1,function(i) matching.row(x,i)),na.rm=TRUE)

Here you first create a matrix with indices that move a window over a row of the same length as the vector you want to match. These windows are then checked against the vector. This approach is followed for every row, and the sum of the rows returning TRUE is what you want.

> matching(c(1),dat)
[1] 3
> matching(c(0,0,0),dat)
[1] 1
> matching(c(1,2),dat)
[1] 2
> matching(5,dat)
[1] 0
like image 59
Joris Meys Avatar answered Oct 11 '22 11:10

Joris Meys


You need to apply a function to the rows of your data:

apply(dat, MARGIN = 1, FUN = is.sub.array, x = c(2,1))

where dat is your data.frame and is.sub.array is a function that checks if x contained in a larger vector (in practice, the rows of your data.frame).

I am not aware of any available such is.sub.array function so here is how I would write it:

is.sub.array <- function(x, y) {
    j <- rep(TRUE, length(y))
    for (i in seq_along(x)) {
        if (i > 1) j <- c(FALSE, head(j, -1))
        j <- j & vapply(y, FUN = function(a,b) isTRUE(all.equal(a, b)),
                        FUN.VALUE = logical(1), b = x[i])
    }
    return(sum(j, na.rm = TRUE) > 0L)
}

(The advantage with using all.equal is that it can be used to compare numeric vectors, something that regular expressions won't be able to do.)

Here are a few examples:

apply(dat, 1, is.sub.array, x = c(1, 2))
# [1]  TRUE FALSE  TRUE
apply(dat, 1, is.sub.array, x = c(0, 0, 0))
# [1]  TRUE FALSE FALSE
apply(dat, 1, is.sub.array, x = as.numeric(c(NA, NA)))
# [1] FALSE  TRUE  TRUE

Note: all.equal is sensitive to your data type, so be careful to use an x with the same type as your data (integer or numeric).

like image 23
flodel Avatar answered Oct 11 '22 10:10

flodel