What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:
stringToSearch1
stringToSearch2
stringToSearch3
I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.
The best method I have come up with so far is
grep -F -f smallF largeF
But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.
Is there a more efficient method?
Open File Explorer and navigate to This PC or the drive you wish to search. In the search field, type size: gigantic and then press Enter. It will search for any files larger than 128 MB.
Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.
I once noticed that using -E
or multiple -e
parameters is faster than using -f
. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:
Here is what I noticed in detail:
Have 1.2GB file filled with random strings.
>ls -has | grep string 1,2G strings.txt >head strings.txt Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0 Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa etrulbGONKT3pact1SHg2ipcCr7TZ9jc .....
Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:
grep "ab" strings.txt > m1.out 2,76s user 0,42s system 96% cpu 3,313 total grep "cd" strings.txt >> m1.out 2,82s user 0,36s system 95% cpu 3,322 total grep "ef" strings.txt >> m1.out 2,78s user 0,36s system 94% cpu 3,360 total
So in total the search takes nearly 10 seconds.
Using grep with -f
flag with search strings in search.txt
>cat search.txt ab cd ef >grep -F -f search.txt strings.txt > m2.out 31,55s user 0,60s system 99% cpu 32,343 total
For some reasons this takes nearly 32 seconds.
Now using multiple search patterns with -e
grep -E "ab|cd|ef" strings.txt > m3.out 3,80s user 0,36s system 98% cpu 4,220 total
or
grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null 3,86s user 0,38s system 98% cpu 4,323 total
The third methode using -E
only took 4.22 seconds to search through the file.
Now lets check if the results are the same:
cat m1.out | sort | uniq > m1.sort cat m3.out | sort | uniq > m3.sort diff m1.sort m3.sort #
The diff produces no output, which means the found results are the same.
Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.
You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With