Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it more efficient to grep twice or use a regular expression once?

Tags:

grep

bash

unix

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?

like image 933
dtbarne Avatar asked May 18 '11 05:05

dtbarne


3 Answers

In theory, the fastest way should be:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).

like image 38
Gordon Davisson Avatar answered Sep 21 '22 02:09

Gordon Davisson


These two operations are fundamentally different. This one:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.

like image 28
Rafe Kettler Avatar answered Sep 21 '22 02:09

Rafe Kettler


grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

  • Both sed and awk perform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

like image 57
pepoluan Avatar answered Sep 19 '22 02:09

pepoluan