Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grep in R to find words with custom "extended" boundaries

Tags:

regex

r

I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. \\b considers digits and underscore as parts of words, not as boundaries.

For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by \\b here).

I've tried:

"[[0-9_]\\b]MOUSE[[0-9_]\\b]"
"[[0-9_]|\\b]MOUSE[[0-9_]|\\b]"
"[$|[^A-Z]]MOUSE[^|[^A-Z]]"
"[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"

None of them work.

I'm actually looking for several words (based on a long vector of values), so the final result should look something like

grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)

(with a different delimiter because \\b is too restrictive for me).

(This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))

like image 286
syre Avatar asked Nov 25 '16 10:11

syre


People also ask

How does word boundary work in regex?

Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What is the difference between \b and \b in regular expression?

Using regex \B-\B matches - between the word color - coded . Using \b-\b on the other hand matches the - in nine-digit and pass-key .

What does \b do in regular expression?

Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

What is boundary in regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2. The dash is not a word character.

What is regular expression in grep?

A regular expression is a search pattern that grep command matches in specified file or in provided text. In order to allow a user to express the regular expression in more customized way, grep assigns special meanings to few characters. These characters are known as Meta characters.

How do I grep multiple words in Unix?

grep extended regex (search multiple words) The pipe sign (|) is used to search multiple words with grep command. To search multiple words with grep command, connect all of them with pipe sign and surround by quote signs. For example to search words abc, fgh, xyz, mno and jkl, use the search pattern "abc|fgh|xyz|mno|jkl".

How to use later added characters as meta characters in grep?

Extended regular expression uses the Meta characters which were added later. Since later added characters are not defined in original implementation, grep treats them as regular characters unless we ask it to use them as Meta characters. To instruct grep command to use later added characters as Meta characters, an option –E is used.

What are extended regular expressions in Linux?

Extended regular expressions A regular expression is a search pattern that grep command matches in specified file or in provided text. In order to allow a user to express the regular expression in more customized way, grep assigns special meanings to few characters. These characters are known as Meta characters.


1 Answers

Use PCRE regex with lookarounds:

grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)

See the regex demo

The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.

To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).

See the R online demo:

> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5
like image 124
Wiktor Stribiżew Avatar answered Sep 27 '22 16:09

Wiktor Stribiżew