I'm searching for the right regular expression. The following
t1 = c("IGF2, IGF2AS, INS, TH", "TH", "THZH", "ZGTH") grep("TH",t1, value=T)
returns all elements of t1
, but only the first and second are correct. I just want entries with word/phrase TH
returned?
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.
You need to add word boundary anchors (\b
) around your search strings so only entire words will be matched (i. e. words surrounded by non-word characters or start/end of string, where "word character" means \w
, i.e. alphanumeric character).
Try
grep("\\bTH\\b",t3, value=T)
You can use \<
and \>
in a regexp to match at the beginning/end of the word.
grep ("\\<TH\\>", t1)
etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With