Searching rows in a data frame in R

Tags:

r

I have strings of numbers not necessarily of the same length e.g.

0,0,1,2,1,0,0,0

1,1,0,1

2,1,2,0,1,0

I have imported these into a dataframe in R e.g. the above three strings would give the following three rows (which I shall call df):

enter image description here

I am looking to write some functions that will help me understand the data. As a starting point - given a numeric vector x - I would like a 'process' P of establishing the number of rows which contain x as a subvector e.g. if x = c(2,1) then P(x) = 2, if x = c(0,0,0) then P(x) = 1 and if x = c(1,3) then P(x) = 0. I have many more similar questions though I am hoping I will be able to take the logic from this question and work out some of the other stuff myself.

888

asked Dec 19 '12 11:12

user1873334

2 Answers

Edit: The regex way would be:

match.regex <- function(x,data){
  xs <- paste(x,collapse="_")
  dats <- apply(data,1,paste,collapse="_")
  sum(grepl(xs,dats))
}


> match.regex(c(1),dat)
[1] 3
> match.regex(c(0,0,0),dat)
[1] 1
> match.regex(c(1,2),dat)
[1] 2
> match.regex(5,dat)
[1] 0

Surprisingly, this one is faster than other methods given here, and about twice as fast as my solution below, both on small and on big datasets. Regexes got pretty much optimized apparently :

> benchmark(matching(c(1,2),dat),match.regex(c(1,2),dat),replications=1000)
                       test replications elapsed relative 
2 match.regex(c(1, 2), dat)         1000    0.15      1.0 
1    matching(c(1, 2), dat)         1000    0.36      2.4

An approach that gives you the number immediately and works more vectorized, is the following:

matching.row <- function(x,row){
    nx <- length(x)
    sid <- which(x[1]==row)
    any(sapply(sid,function(i) all(row[seq(i,i+nx-1)]==x)))
}

matching <- function(x,data)
  sum(apply(data,1,function(i) matching.row(x,i)),na.rm=TRUE)

Here you first create a matrix with indices that move a window over a row of the same length as the vector you want to match. These windows are then checked against the vector. This approach is followed for every row, and the sum of the rows returning TRUE is what you want.

> matching(c(1),dat)
[1] 3
> matching(c(0,0,0),dat)
[1] 1
> matching(c(1,2),dat)
[1] 2
> matching(5,dat)
[1] 0

answered Oct 11 '22 11:10

Joris Meys

You need to apply a function to the rows of your data:

apply(dat, MARGIN = 1, FUN = is.sub.array, x = c(2,1))

where dat is your data.frame and is.sub.array is a function that checks if x contained in a larger vector (in practice, the rows of your data.frame).

I am not aware of any available such is.sub.array function so here is how I would write it:

is.sub.array <- function(x, y) {
    j <- rep(TRUE, length(y))
    for (i in seq_along(x)) {
        if (i > 1) j <- c(FALSE, head(j, -1))
        j <- j & vapply(y, FUN = function(a,b) isTRUE(all.equal(a, b)),
                        FUN.VALUE = logical(1), b = x[i])
    }
    return(sum(j, na.rm = TRUE) > 0L)
}

(The advantage with using all.equal is that it can be used to compare numeric vectors, something that regular expressions won't be able to do.)

Here are a few examples:

apply(dat, 1, is.sub.array, x = c(1, 2))
# [1]  TRUE FALSE  TRUE
apply(dat, 1, is.sub.array, x = c(0, 0, 0))
# [1]  TRUE FALSE FALSE
apply(dat, 1, is.sub.array, x = as.numeric(c(NA, NA)))
# [1] FALSE  TRUE  TRUE

Note: all.equal is sensitive to your data type, so be careful to use an x with the same type as your data (integer or numeric).

answered Oct 11 '22 10:10

flodel

Related questions
                            
                                Can R create a barplot image with clickable bars to insert on a webpage?
                            
                                writing to global variables in using doSNOW and doing parallelization in R?
                            
                                kafka consumer in R
                            
                                Using an if-else statement to conditionally define a function in `R`
                            
                                How to reproduce the pareto.chart plot from the qcc package using ggplot2?
                            
                                How to select a part of formula in formula in R?
                            
                                Create non-overlapping stacked area plot with ggplot2
                            
                                caret: Error when using anything but LOOCV with rpart
                            
                                How would you represent the following 3D data in Matplotlib or Mayavi?
                            
                                R regression with months as independent variables (labels)
                            
                                How to tell R's ggplot2 to put tick marks for some values of x-axis and still keep vertical lines for other values
                            
                                R tm package create matrix of Nmost frequent terms
                            
                                Call plot() from an R script and get the graph in the output file?
                            
                                multcomp Tukey-Kramer
                            
                                doing PCA on very large data set in R
                            
                                Overlay 10 density plots in R with colour proportional to number of overlapping plots
                            
                                append list to a list
                            
                                How to extract a p-value when performing anova() between two glm models in R
                            
                                How can I create a column that indicates the observation's lag from another observation in R?
                            
                                Univariate outlier detection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With