Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string search in a very large file

Tags:

linux

grep

bash

What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:

stringToSearch1
stringToSearch2
stringToSearch3

I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.

The best method I have come up with so far is

grep -F -f smallF largeF 

But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.

Is there a more efficient method?

like image 865
user262540 Avatar asked Jun 08 '16 05:06

user262540


People also ask

How do I search for large files?

Open File Explorer and navigate to This PC or the drive you wish to search. In the search field, type size: gigantic and then press Enter. It will search for any files larger than 128 MB.

How do you search for a specific word in a large text file in Java?

Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.


2 Answers

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:

Here is what I noticed in detail:

Have 1.2GB file filled with random strings.

>ls -has | grep string 1,2G strings.txt  >head strings.txt Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0 Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa etrulbGONKT3pact1SHg2ipcCr7TZ9jc ..... 

Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:

  1. Using grep without flags, search one at a time:
    grep "ab" strings.txt > m1.out       2,76s user 0,42s system 96% cpu 3,313 total          grep "cd" strings.txt >> m1.out       2,82s user 0,36s system 95% cpu 3,322 total          grep "ef" strings.txt >> m1.out       2,78s user 0,36s system 94% cpu 3,360 total 

So in total the search takes nearly 10 seconds.

  1. Using grep with -f flag with search strings in search.txt

     >cat search.txt   ab   cd   ef   >grep -F -f search.txt strings.txt > m2.out    31,55s user 0,60s system 99% cpu 32,343 total 

For some reasons this takes nearly 32 seconds.

  1. Now using multiple search patterns with -e

     grep -E "ab|cd|ef" strings.txt > m3.out    3,80s user 0,36s system 98% cpu 4,220 total 

    or

     grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null    3,86s user 0,38s system 98% cpu 4,323 total 

The third methode using -E only took 4.22 seconds to search through the file.

Now lets check if the results are the same:

cat m1.out | sort | uniq > m1.sort   cat m3.out | sort | uniq > m3.sort diff m1.sort m3.sort # 

The diff produces no output, which means the found results are the same.

Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

like image 184
cb0 Avatar answered Oct 15 '22 00:10

cb0


You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.

like image 23
ajfabbri Avatar answered Oct 15 '22 01:10

ajfabbri