Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using grep to help subset a data frame

I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.

My data frame:

x <- c("G448", "G459", "G479", "G406")   y <- c(1:4) My.Data <- data.frame (x,y) 

I have tried:

subset (My.Data, x=="G45*") 

But I am unsure how to use wildcards. I have also tried grep() to find the indicies:

grep  ("G45*", My.Data$x) 

but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.

like image 355
Stewart Wiseman Avatar asked Jan 23 '14 14:01

Stewart Wiseman


People also ask

How use Grepl subset in R?

The grepl function in R search for matches to argument pattern within each element of a character vector or column of an R data frame. If we want to subset rows of an R data frame using grepl then subsetting with single-square brackets and grepl can be used by accessing the column that contains character values.

How do you subset data frames in R?

Subset a Data Frame with Base R Extract[] To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] .


1 Answers

It's pretty straightforward using [ to extract:

grep will give you the position in which it matched your search pattern (unless you use value = TRUE).

grep("^G45", My.Data$x) # [1] 2 

Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).

My.Data[grep("^G45", My.Data$x), ] #      x y # 2 G459 2 

The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.

subset(My.Data, grepl("^G45", My.Data$x)) #      x y # 2 G459 2 

As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.

subset(My.Data, startsWith(as.character(x), "G45")) #      x y # 2 G459 2 
like image 157
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 15 '22 16:10

A5C1D2H2I1M1N2O1R2T1