Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if all of multiple strings or regexes exist in a file

Tags:

grep

bash

search

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:

... string1 ... string2 ... string3 ... string1 string2 ... string1 string2 string3 ... string3 string1 string2 ... string2 string3 ... and so on 

In the above example, we could have regexes in place of strings.

For example, the following code checks if any of my strings exists in the file:

if grep -EFq "string1|string2|string3" file; then   # there is at least one match fi 

How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.

Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?

Also, is there a solution for strings that can easily be extended for regexes?

like image 489
codeforester Avatar asked Apr 10 '18 20:04

codeforester


People also ask

How do I search for multiple strings in a file?

The basic grep syntax when searching multiple patterns in a file includes using the grep command followed by strings and the name of the file or its path. The patterns need to be enclosed using single quotes and separated by the pipe symbol.

How do you check if multiple values are in a string?

You can use any : a_string = "A string is more than its parts!" matches = ["more", "wholesome", "milk"] if any(x in a_string for x in matches): Similarly to check if all the strings from the list are found, use all instead of any .

How can I grep for multiple patterns and print them on the same line?

Use sed to copy the parts of the line that match the pattern to the output, using capture groups. This assumes that the patterns are always in this order on the lines.


2 Answers

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file 

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' ' BEGIN {     numStrings = split(strings,tmp)     for (i in tmp) strs[tmp[i]] } numStrings == 0 { exit } {     for (str in strs) {         if ( index($0,str) ) {             delete strs[str]             numStrings--         }     } } END { exit (numStrings ? 1 : 0) } ' file 

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file 

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file 

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk ' NR==FNR { strings[$0]; next } {     for (string in strings)         if ( !index($0,string) )             exit 1 } ' file_of_strings RS='^$' file_to_be_searched 

and for regexps it'd be:

awk ' NR==FNR { regexps[$0]; next } {     for (regexp in regexps)         if ( $0 !~ regexp )             exit 1 } ' file_of_regexps RS='^$' file_to_be_searched 

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk ' NR==FNR { strings[$0]; numStrings=NR; next } numStrings == 0 { exit } {     for (string in strings) {         if ( index($0,string) ) {             delete strings[string]             numStrings--         }     } } END { exit (numStrings ? 1 : 0) } ' file_of_strings file_to_be_searched 

and the equivalent for regexps:

awk ' NR==FNR { regexps[$0]; numRegexps=NR; next } numRegexps == 0 { exit } {     for (regexp in regexps) {         if ( $0 ~ regexp ) {             delete regexps[regexp]             numRegexps--         }     } } END { exit (numRegexps ? 1 : 0) } ' file_of_regexps file_to_be_searched 
like image 157
Ed Morton Avatar answered Sep 20 '22 18:09

Ed Morton


git grep

Here is the syntax using git grep with multiple patterns:

git grep --all-match --no-index -l -e string1 -e string2 -e string3 file 

You may also combine patterns with Boolean expressions such as --and, --or and --not.

Check man git-grep for help.


--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

Other params to consider:

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.

like image 22
kenorb Avatar answered Sep 22 '22 18:09

kenorb