Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep -vf too slow with large files

I am trying filter data from data.txt using patterns stored in a file filter.txt. Like below,

grep -v -f filter.txt data.txt > op.txt

This grep takes more than 10-15 minutes for 30-40K lines in filter.txt and ~300K lines in data.txt.

Is there any way to speed up this?

data.txt

data1
data2
data3

filter.txt

data1

op.txt

data2
data3

This works with solution provided by codeforester but fails when filter.txt is empty.

like image 707
user3150037 Avatar asked Mar 09 '17 18:03

user3150037


People also ask

Why is grep taking so long?

If you're running grep over a very large number of files it will be slow because it needs to open them all and read through them. If you have some idea of where the file you're looking for might be try to limit the number of files that have to be searched through that way.

Does grep have a file size limit?

Though grep expects to do the matching on text, it has no limits on input line length other than available memory, and it can match arbitrary characters within a line.

Is there anything faster than grep?

Is fast grep faster? The grep utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, fgrep may be much faster than grep .

Is awk faster than grep?

When only searching for strings, and speed matters, you should almost always use grep . It's orders of magnitude faster than awk when it comes to just gross searching.


1 Answers

Based on Inian's solution in the related post, this awk command should solve your issue:

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt
like image 66
codeforester Avatar answered Sep 28 '22 04:09

codeforester