Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Finding multiple string matches in a vector of strings

Tags:

I have the following list of file names:

files.list <- c("Fasted DWeib NoCmaxW.xlsx", "Fed DWeib NoCmaxW.xlsx", "Fasted SWeib NoCmaxW.xlsx", "Fed SWeib NoCmaxW.xlsx", "Fasted DWeib Cmax10.xlsx", "Fed DWeib Cmax10.xlsx", "Fasted SWeib Cmax10.xlsx", "Fed SWeib Cmax10.xlsx")

I want to identify which files have the following sub-strings:

toMatch <- c("Fasted", "DWeib NoCmaxW")

The examples I have found often quote the following usage:

grep(paste(toMatch, collapse = "|"), files.list, value=TRUE)

However, this returns four possibilities:

[1] "Fasted DWeib NoCmaxW.xlsx" "Fed DWeib NoCmaxW.xlsx"    "Fasted SWeib NoCmaxW.xlsx"
[4] "Fasted DWeib Cmax10.xlsx"  "Fasted SWeib Cmax10.xlsx" 

I want the filename which contains both elements of toMatch (i.e. "Fasted" and "DWeib NoCmaxW"). There is only one file which satisfies that requirement (files.list[1]). I assumed the "|" in the paste command might be a logical OR, and so I tried "&", but that didn't address my problem.

Can someone please help?

Thank you.

like image 597
please help Avatar asked May 20 '18 03:05

please help


People also ask

How do you match a string in vector?

Find String Matches in a Vector or Matrix in R Programming – str_detect() Function. str_detect() Function in R Language is used to check if the specified match of the substring exists in the original string. It will return TRUE for a match found otherwise FALSE against each of the element of the Vector or matrix.

How to find a match string in R?

If we need to find the location of the required string/pattern, we can use the grep() method. On the other hand, if we just need to know whether the pattern exists or not, we can use the logical function grepl() which returns either True or False based on the result.

How do you check if a string is in a vector in R?

%in% operator can be used in R Programming Language, to check for the presence of an element inside a vector. It returns a boolean output, evaluating to TRUE if the element is present, else returns false.


1 Answers

We can use &

i1 <- grepl(toMatch[1], files.list) & grepl(toMatch[2], files.list)

If there are multiple elements in 'toMatch', loop through them with lapply and Reduce to a single logical vector with &

i1 <- Reduce(`&`, lapply(toMatch, grepl, x = files.list))
files.list[i1]
#[1] "Fasted DWeib NoCmaxW.xlsx"

It is also possible to collapse the elements with .* i.e. to match first word of 'toMatch' followed by a word boundary(\\b) then some characters (.*) and another word boundary (\\b) before the second word of 'toMatch'. In this example it works. May be it is better to add the word boundary at the start and end as well (which is not needed for this example)

pat1 <- paste(toMatch, collapse= "\\b.*\\b")
grep(pat1, files.list, value = TRUE)
#[1] "Fasted DWeib NoCmaxW.xlsx"

But, this will look for matches in the same order of words in 'toMatch'. In case, if have substring in reverse order and want to match those as well, create the pattern in the reverse order and then collapse with |

pat2 <- paste(rev(toMatch), collapse="\\b.*\\b")
pat <- paste(pat1, pat2, sep="|")
grep(pat, files.list, value = TRUE) 
#[1] "Fasted DWeib NoCmaxW.xlsx"
like image 67
akrun Avatar answered Oct 11 '22 15:10

akrun