Note:
Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:
Background:
I have a list of words in a file called words.txt
(one word per line). I would like to find all lines from a different, much larger file called file.txt
that contain any of the words from words.txt
. However, I only want whole-word matches. This means that a match should be made when a line from file.txt
contains at least one instance where a word from words.txt
is found "all by itself" (I know this is vague, so allow me to explain).
In other words, a match should be made when:
For example, if one of the words in words.txt
is cat
, I would like it to behave as follows:
cat #=> match
cat cat cat #=> match
the cat is gray #=> match
mouse,cat,dog #=> match
caterpillar cat #=> match
caterpillar #=> no match
concatenate #=> no match
bobcat #=> no match
catcat #=> no match
cat100 #=> no match
cat-in-law #=> no match
Previous research:
There's a grep
command that almost suits my needs. It is as follows:
grep -wf words.txt file.txt
where the options are:
-w, --word-regexp
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the beginning
of the line, or preceded by a non-word constituent character.
Similarly, it must be either at the end of the line or followed by a
non-word constituent character. Word-constituent characters are
letters, digits, and the underscore.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing.
The big problem I'm having with this is that it treats a hyphen (i.e. -
) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat
will return cat-in-law
, which is not what I want.
I realize that the -w
option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat
) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law
) and not an instance of the word by itself.
Additionally, I know I could alter words.txt
to contain regular expressions instead of fixed strings and then use:
grep -Ef words.txt file.txt
where
-E, --extended-regexp
Interpret PATTERN as an extended regular expression
However, I would like to avoid altering words.txt
and keep it free of regex patterns.
Question:
Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?
I finally came up with a solution:
grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt
Explanation:
words.txt
is my list of words (one per line).file.txt
is the body of text that I would like to search.awk
command will preprocess words.txt
on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above). awk
command is surrounded by <(
and )
so that its output is used as the input for the -f
option.-E
option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt
.The nice thing here is that words.txt
can remain human-readable and doesn't have to contain a bunch of regex patterns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With