What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like: stringToSearch1 stringToSearch2 stringToSearch3 I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed. The best method I have come up with so far is <pre class="prettyprint"><code>grep -F -f smallF largeF </code></pre> But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time. Is there a more efficient method?

I once noticed that using <code>-E</code> or multiple <code>-e</code> parameters is faster than using <code>-f</code>. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing: Here is what I noticed in detail: Have 1.2GB file filled with random strings. <pre class="prettyprint"><code>>ls -has | grep string 1,2G strings.txt >head strings.txt Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0 Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa etrulbGONKT3pact1SHg2ipcCr7TZ9jc ..... </code></pre> Now I want to search for strings "ab", "cd" and "ef" using different grep approaches: <ol> <li>Using grep without flags, search one at a time:</li> </ol> <pre class="prettyprint"><code> grep "ab" strings.txt > m1.out 2,76s user 0,42s system 96% cpu 3,313 total grep "cd" strings.txt >> m1.out 2,82s user 0,36s system 95% cpu 3,322 total grep "ef" strings.txt >> m1.out 2,78s user 0,36s system 94% cpu 3,360 total </code></pre> So in total the search takes nearly 10 seconds. <ol start="2"> <li> Using grep with <code>-f</code> flag with search strings in search.txt <pre class="prettyprint"><code> >cat search.txt ab cd ef >grep -F -f search.txt strings.txt > m2.out 31,55s user 0,60s system 99% cpu 32,343 total </code></pre> </li> </ol> For some reasons this takes nearly 32 seconds. <ol start="3"> <li> Now using multiple search patterns with <code>-e</code> <pre class="prettyprint"><code> grep -E "ab|cd|ef" strings.txt > m3.out 3,80s user 0,36s system 98% cpu 4,220 total </code></pre> or <pre class="prettyprint"><code> grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null 3,86s user 0,38s system 98% cpu 4,323 total </code></pre> </li> </ol> The third methode using <code>-E</code> only took 4.22 seconds to search through the file. Now lets check if the results are the same: <pre class="prettyprint"><code>cat m1.out | sort | uniq > m1.sort cat m3.out | sort | uniq > m3.sort diff m1.sort m3.sort # </code></pre> The diff produces no output, which means the found results are the same. Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

Fast string search in a very large file

Tags:

linux

grep

bash

What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:

stringToSearch1
stringToSearch2
stringToSearch3

I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.

The best method I have come up with so far is

grep -F -f smallF largeF

But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.

Is there a more efficient method?

865

asked Jun 08 '16 05:06

user262540

2 Answers

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:

Here is what I noticed in detail:

Have 1.2GB file filled with random strings.

>ls -has | grep string 1,2G strings.txt  >head strings.txt Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0 Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa etrulbGONKT3pact1SHg2ipcCr7TZ9jc .....

Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:

Using grep without flags, search one at a time:

    grep "ab" strings.txt > m1.out       2,76s user 0,42s system 96% cpu 3,313 total          grep "cd" strings.txt >> m1.out       2,82s user 0,36s system 95% cpu 3,322 total          grep "ef" strings.txt >> m1.out       2,78s user 0,36s system 94% cpu 3,360 total

So in total the search takes nearly 10 seconds.

Using grep with -f flag with search strings in search.txt

 >cat search.txt   ab   cd   ef   >grep -F -f search.txt strings.txt > m2.out    31,55s user 0,60s system 99% cpu 32,343 total

For some reasons this takes nearly 32 seconds.

Now using multiple search patterns with -e

 grep -E "ab|cd|ef" strings.txt > m3.out    3,80s user 0,36s system 98% cpu 4,220 total

 grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null    3,86s user 0,38s system 98% cpu 4,323 total

The third methode using -E only took 4.22 seconds to search through the file.

Now lets check if the results are the same:

cat m1.out | sort | uniq > m1.sort   cat m3.out | sort | uniq > m3.sort diff m1.sort m3.sort #

The diff produces no output, which means the found results are the same.

Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

184

answered Oct 15 '22 00:10

cb0

You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.

answered Oct 15 '22 01:10

ajfabbri

Related questions
                            
                                Obtain the Linux UID of an Android App
                            
                                DTrace on Ubuntu, how-to?
                            
                                gcc-arm-linux-gnueabi command not found
                            
                                How to get a list of video capture devices (web cameras) on linux ( ubuntu )? (C/C++)
                            
                                Disk Space in Linux Server [closed]
                            
                                how to kill the tty in unix
                            
                                How to fix conda update conda permission error
                            
                                png.h file not found - Linux
                            
                                Parallel download using Curl command line utility
                            
                                Using output of awk to run command
                            
                                Windows CE vs Embedded Linux [closed]
                            
                                Delete all files except the newest 3 in bash script
                            
                                Setting The Environment for System.in
                            
                                Invalid string: control characters from U+0000 through U+001F must be escaped using Bash? [duplicate]
                            
                                Excessive mysterious system time use in a GHC-compiled binary
                            
                                In GTK/Linux, what's the correct way to get the DPI scale factor?
                            
                                Creating a full directory tree at once
                            
                                Best practices for git repositories on open source projects
                            
                                .NET decompiler for Mac or Linux
                            
                                Command to see 'R' path that RStudio is using

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With