Is it more efficient to grep twice or use a regular expression once?

Question

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?

Gordon Davisson · Accepted Answer

In theory, the fastest way should be:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).

Rafe Kettler · Answer

These two operations are fundamentally different. This one:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.

pepoluan · Answer

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

Both sed and awk perform their own file reading; no need for pipe overhead
The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

Is it more efficient to grep twice or use a regular expression once?

Tags:

grep

bash

unix

dtbarne

3 Answers

Gordon Davisson

Rafe Kettler

pepoluan

Recent Activity

Donate For Us

Is it more efficient to grep twice or use a regular expression once?

Tags:

grep

bash

unix

dtbarne

3 Answers

Gordon Davisson

Rafe Kettler

pepoluan

Related questions

Recent Activity

Donate For Us