Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whole-word matching on a body of text, given a list of words

Note:

Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:

  • How to grep with a list of words
  • How to make grep only match if the entire line matches?
  • how to grep for the whole word
  • Grep extract only whole word

Background:

I have a list of words in a file called words.txt (one word per line). I would like to find all lines from a different, much larger file called file.txt that contain any of the words from words.txt. However, I only want whole-word matches. This means that a match should be made when a line from file.txt contains at least one instance where a word from words.txt is found "all by itself" (I know this is vague, so allow me to explain).

In other words, a match should be made when:

  1. The word is all by itself on a line
  2. The word is surrounded by non-alphanumeric/non-hyphen characters
  3. The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character
  4. The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character

For example, if one of the words in words.txt is cat, I would like it to behave as follows:

cat              #=> match
cat cat cat      #=> match
the cat is gray  #=> match
mouse,cat,dog    #=> match
caterpillar cat  #=> match
caterpillar      #=> no match
concatenate      #=> no match
bobcat           #=> no match
catcat           #=> no match
cat100           #=> no match
cat-in-law       #=> no match

Previous research:

There's a grep command that almost suits my needs. It is as follows:

grep -wf words.txt file.txt

where the options are:

-w, --word-regexp
       Select only those lines containing matches that form whole words.
       The test is that the matching substring must either be at the beginning
       of the line, or preceded by a non-word constituent character.
       Similarly, it must be either at the end of the line or followed by a
       non-word constituent character. Word-constituent characters are
       letters, digits, and the underscore.
-f FILE, --file=FILE
       Obtain patterns from FILE, one per line. The empty file contains
       zero patterns, and therefore matches nothing.

The big problem I'm having with this is that it treats a hyphen (i.e. -) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat will return cat-in-law, which is not what I want.

I realize that the -w option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law) and not an instance of the word by itself.

Additionally, I know I could alter words.txt to contain regular expressions instead of fixed strings and then use:

grep -Ef words.txt file.txt

where

-E, --extended-regexp
              Interpret PATTERN as an extended regular expression

However, I would like to avoid altering words.txt and keep it free of regex patterns.

Question:

Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?

like image 889
seane Avatar asked May 26 '15 22:05

seane


1 Answers

I finally came up with a solution:

grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt

Explanation:

  • words.txt is my list of words (one per line).
  • file.txt is the body of text that I would like to search.
  • The awk command will preprocess words.txt on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above).
  • The awk command is surrounded by <( and ) so that its output is used as the input for the -f option.
  • I'm using the -E option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt.

The nice thing here is that words.txt can remain human-readable and doesn't have to contain a bunch of regex patterns.

like image 183
seane Avatar answered Sep 21 '22 18:09

seane