Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to grep within a grep

Tags:

linux

grep

unix

I have a bunch of massive text files, about 100MB each.

I want to grep to find entries that have 'INDIANA JONES' in it:

$ grep -ir 'INDIANA JONES' ./

Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?

# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char
like image 994
David542 Avatar asked Nov 09 '13 00:11

David542


3 Answers

Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:

grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"

If you need the original match, add the -n flag to the second grep and pipe into:

cut -f1 -d: > line_numbers.txt

then you could use awk to print those lines:

awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt

To avoid the temporary file, this could be written like:

awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt

For multiple files, use find and a bash loop:

for i in $(find . -type f); do
    awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done
like image 160
Steve Avatar answered Oct 02 '22 02:10

Steve


One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory

awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt
like image 34
iruvar Avatar answered Oct 02 '22 01:10

iruvar


Consider installing ack-grep.

sudo apt-get install ack-grep

ack-grep is a more powerful version of grep.

There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.

This may not be a number of chars, but is a step further in that direction.

While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.

The ack-grep manual:

http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html

EDIT:

I think the sad news is, what you might think you're looking for is something like:

grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename

The problem is, even on a small file, querying this is going to be impossible time-wise. I got this one to work with a different number. it's a size problem.

For such a large set of files, you'll need to do this in more than one step.

A Solution:

The only solution I know of is the leading and trailing output from ack-grep.

Step 1: how long are your lines?

If you knew how many lines out you had to go (and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).

You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.

With that, call: (if you needed 100 lines for 5000 chars)

ack-grep -ira "PORTUGAL" -A 100 -B 100 filename

and

ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename

replace the 100s with what you need.

Step 2: parse the output

you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.

Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.

This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.

like image 43
Plasmarob Avatar answered Oct 02 '22 01:10

Plasmarob