grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f
option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql
commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes. Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f
bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk
:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0
). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q
instead of the faster index($0, q)
. Note that this uses POSIX extended regular expressions, roughly the same as grep -E
or egrep
but without bounded quantifiers ({1,7}
) or the GNU extensions for word boundaries (\b
) and shorthand character classes (\s
,\w
, etc).
These should work as long as the hash doesn't exceed what awk
can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With