Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R grep and exact matches

Tags:

regex

r

It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:

Sources <- c(
                "Coal burning plant",
                "General plant",
                "coalescent plantation",
                "Charcoal burning plant"
        )

Registry <- seq(from = 1100, to = 1103, by = 1)

df <- data.frame(Registry, Sources)

If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE), it returns

"Coal burning plant"     
"coalescent plantation"  
"Charcoal burning plant" 

However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"

like image 870
sedeh Avatar asked Jun 16 '14 01:06

sedeh


People also ask

How do you grep with exact match?

To Show Lines That Exactly Match a Search String The grep command prints entire lines when it finds a match in a file. To print only those lines that completely match the search string, add the -x option. The output shows only the lines with the exact match.

Which grep option will look for the exact match only?

grep exact match with -w Now with grep we have an argument ( -w ) which is used to grep for exact match of whole word from a file.

How do you search for an exact word in Linux?

Grep is a Linux command-line tool used to search for a specific string or text in the file. You can use it with a regular expression to be more flexible at finding strings. You can also use the grep command to find only those lines that completely match the search string.


2 Answers

If you always want the order "coal" then "plant", then this should work

grep("\\b[Cc]oal\\b.*\\b[Pp]lant\\b", Sources, perl = TRUE, value=T)

Here we add \b match which stands for a word boundary. You can add the word boundaries to your original attempt we well

grep("(?=.*\\b[Pp]lant\\b)(?=.*\\b[Cc]oal\\b)", Sources, 
    perl = TRUE, value = TRUE)
like image 40
MrFlick Avatar answered Sep 18 '22 12:09

MrFlick


You want to use word boundaries \b around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i) modifier for case-insensitive matching.

grep('(?i)(?=.*\\bplant\\b)(?=.*\\bcoal\\b)', df$Sources, perl=T, value=T)

Working Demo

like image 161
hwnd Avatar answered Sep 21 '22 12:09

hwnd