I'm trying to select rows in a dataframe where the string contained in a column matches either a regular expression or a substring:
dataframe:
aName bName pName call alleles logRatio strength AX-11086564 F08_ADN103 2011-02-10_R10 AB CG 0.363371 10.184215 AX-11086564 A01_CD1919 2011-02-24_R11 BB GG -1.352707 9.54909 AX-11086564 B05_CD2920 2011-01-27_R6 AB CG -0.183802 9.766334 AX-11086564 D04_CD5950 2011-02-09_R9 AB CG 0.162586 10.165051 AX-11086564 D07_CD6025 2011-02-10_R10 AB CG -0.397097 9.940238 AX-11086564 B05_CD3630 2011-02-02_R7 AA CC 2.349906 9.153076 AX-11086564 D04_ADN103 2011-02-10_R2 BB GG -1.898088 9.872966 AX-11086564 A01_CD2588 2011-01-27_R5 BB GG -1.208094 9.239801
For example, I want a dataframe containing only rows that contain ADN
in column bName
. Secondarily, I would like all rows that contain ADN
in column bName
and that match 2011-02-10_R2
in column pName
.
I tried using functions grep()
, agrep()
and more but without success...
By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Details. A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE . There is also fixed = TRUE which can be considered to use a literal regular expression.
A regular expression is a pattern of text that consists of ordinary characters, for example, letters a through z, and special characters. Character(s) Matches in searched string.
subset(dat, grepl("ADN", bName) & pName == "2011-02-10_R2" )
Note "&" (and not "&&" which is not vectorized) and that "==" (and not"=" which is assignment).
Note that you could have used:
dat[ with(dat, grepl("ADN", bName) & pName == "2011-02-10_R2" ) , ]
... and that might be preferable when used inside functions, however, that will return NA values for any lines where dat$pName is NA. That defect (which some regard as a feature) could be removed by the addition of & !is.na(dat$pName)
to the logical expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With