I have two files, <code>file1.txt</code> and <code>file2.txt</code>. <code>file1.txt</code> has about 14K lines and <code>file2.txt</code> has about 2 billions. <code>file1.txt</code> has a single field <code>f1</code> per line while <code>file2.txt</code> has 3 fields, <code>f1</code> through <code>f3</code>, delimited by <code>|</code>. I want to find all lines from <code>file2.txt</code> where <code>f1</code> of <code>file1.txt</code> matches <code>f2</code> of <code>file2.txt</code> (or anywhere on the line if we don't want to spend extra time splitting the values of <code>file2.txt</code>). file1.txt (about 14K lines, not sorted): <pre class="prettyprint"><code>foo1 foo2 ... bar1 bar2 ... </code></pre> file2.txt (about 2 billion lines, not sorted): <pre class="prettyprint"><code>date1|foo1|number1 date2|foo2|number2 ... date1|bar1|number1 date2|bar2|number2 ... </code></pre> Output expected: <pre class="prettyprint"><code>date1|foo1|number1 date2|foo2|number2 ... date1|bar1|number1 date2|bar2|number2 ... </code></pre> Here is what I have tried and it seems to be taking several hours to run: <pre class="prettyprint"><code>fgrep -F -f file1.txt file2.txt > file.matched </code></pre> I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

I have tried to do a comparison between some of the methods presented here. First I created a Perl script to generate the input files <code>file1.txt</code> and <code>file2.txt</code>. In order to compare some of the solutions, I made sure that the the words from <code>file1.txt</code> only could appear in the second field in <code>file2.txt</code>. Also to be able to use the <code>join</code> solution presented by @GeorgeVasiliou, I sorted <code>file1.txt</code> and <code>file2.txt</code>. Currently I generated the input files based on only 75 random words ( taken from https://www.randomlists.com/random-words ). Only 5 of these 75 words was used in <code>file1.txt</code> the remaining 70 words was used to fill up the fields in <code>file2.txt</code>. It might be necessary to increase the number of words substantially to get realistic results ( according to the OP, the original <code>file1.txt</code> contained 14000 words). In the tests below I used a <code>file2.txt</code> with 1000000 ( 1 million ) lines. The script also generates the file <code>regexp1.txt</code> required by the grep solution of @BOC. gen_input_files.pl: <pre class="prettyprint"><code>#! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Data::Printer; use Getopt::Long; GetOptions ("num_lines=i" => \my $nlines ) or die("Error in command line arguments\n"); # Generated random words from site: https://www.randomlists.com/random-words my $word_filename = 'words.txt'; # 75 random words my $num_match_words = 5; my $num_file2_lines = $nlines || 1_000_000; my $file2_words_per_line = 3; my $file2_match_field_no = 2; my $file1_filename = 'file1.txt'; my $file2_filename = 'file2.txt'; my $file1_regex_fn = 'regexp1.txt'; say "generating $num_file2_lines lines.."; my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words ); write_file1( $file1_filename, $words2 ); write_file2( $file2_filename, $words1, $words2, $num_file2_lines, $file2_words_per_line, $file2_match_field_no ); write_BOC_regexp_file( $file1_regex_fn, $words2 ); sub write_BOC_regexp_file { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh '\\|' . (join "|", @$words) . '\\|'; close $fh; } sub write_file2 { my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_; my $nwords1 = scalar @$words1; my $nwords2 = scalar @$words2; my @lines; for (1..$nlines) { my @words_line; my $key; for (1..$words_per_line) { my $word; if ( $_ != $field_no ) { my $index = int (rand $nwords1); $word = @{ $words1 }[$index]; } else { my $index = int (rand($nwords1 + $nwords2) ); if ( $index < $nwords2 ) { $word = @{ $words2 }[$index]; } else { $word = @{ $words1 }[$index - $nwords2]; } $key = $word; } push @words_line, $word; } push @lines, [$key, (join "|", @words_line)]; } @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", @lines); close $fh; } sub write_file1 { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", sort @$words); close $fh; } sub get_words { my ( $fn, $N ) = @_; open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my @words = map {chomp $_; $_} <$fh>; close $fh; my @words1 = @words[$N..$#words]; my @words2 = @words[0..($N - 1)]; return ( \@words1, \@words2 ); } </code></pre> Next, I created a sub folder <code>solutions</code> with all the test cases: <pre class="prettyprint"><code>$ tree solutions/ solutions/ ├── BOC1 │ ├── out.txt │ └── run.sh ├── BOC2 │ ├── out.txt │ └── run.sh ├── codeforester │ ├── out.txt │ ├── run.pl │ └── run.sh [...] </code></pre> Here the files <code>out.txt</code> is the output from the greps for each solution. The scripts <code>run.sh</code> runs the solution for the given test case. <h3>Notes on the different solutions</h3> <ul> <li> <code>BOC1</code> : First solution presented by @BOC <pre class="prettyprint"><code>grep -E -f regexp1.txt file2.txt </code></pre> </li> <li> <code>BOC2</code> : Second solution suggested by @BOC: <pre class="prettyprint"><code>LC_ALL=C grep -E -f regexp1.txt file2.txt </code></pre> </li> <li><code>codeforester</code> : Accepted Perl solution by @codeforester ( see source )</li> <li> <code>codeforester_orig</code> : Original solution presented by @codeforested: <pre class="prettyprint"><code>fgrep -f file1.txt file2.txt </code></pre> </li> <li><code>dawg</code> : Python solution using dictionary and split line proposed by @dawg ( see source )</li> <li> <code>gregory1</code> : solution using Gnu Parallel suggested by @gregory <pre class="prettyprint"><code>parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt </code></pre> See note below regarding how to choose <code>$block_size</code>. </li> <li><code>hakon1</code> : Perl solution provided by @HåkonHægland (see source). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation when <code>file1.txt</code> or <code>file2.txt</code> changes. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.</li> <li> <code>ikegami</code> : Solution using assembled regexp and using <code>grep -P</code> as given by @ikegami. Note: The assembled regexp was written to a separate file <code>regexp_ikegami.txt</code>, so the runtime of generating the regexp is not included in the comparison below. This is the code used: <pre class="prettyprint"><code>regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt </code></pre> </li> <li> <code>inian1</code> : First solution by @Inian using <code>match()</code> <pre class="prettyprint"><code>awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt </code></pre> </li> <li> <code>inian2</code> : Second solution by @Inian using <code>index()</code> <pre class="prettyprint"><code>awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt </code></pre> </li> <li> <code>inian3</code> : Third solution by @Inian checking only <code>$2</code> field: <pre class="prettyprint"><code>awk 'FNR==NR{ hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt </code></pre> </li> <li> <code>inian4</code> : 4th soultion by @Inian ( basically the same as <code>codeforester_orig</code> with <code>LC_ALL</code> ) : <pre class="prettyprint"><code>LC_ALL=C fgrep -f file1.txt file2.txt </code></pre> </li> <li> <code>inian5</code> : 5th solution by @Inian (same as <code>inian1</code> but with <code>LC_ALL</code> ): <pre class="prettyprint"><code>LC_ALL=C awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt </code></pre> </li> <li><code>inian6</code> : Same as <code>inian3</code> but with <code>LC_ALL=C</code>. Thanks to @GeorgeVasiliou for suggestion.</li> <li><code>jjoao</code> : Compiled flex-generated C code as proposed by @JJoao (see source ). Note: Recompilation of the exectuable must be done each time <code>file1.txt</code> changes. The time used to compile the executable is not included in the run times presented below.</li> <li><code>oliv</code> : Python script provided by @oliv ( see source )</li> <li> <code>Vasiliou</code> : Using <code>join</code> as suggested by @GeorgeVasiliou: <pre class="prettyprint"><code>join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt </code></pre> </li> <li><code>Vasiliou2</code> : Same as <code>Vasiliou</code> but with <code>LC_ALL=C</code>.</li> <li><code>zdim</code> : Using Perl script provided by @zdim ( see source ). Note: This uses the regexp search version ( instead of split line solution ).</li> <li><code>zdim2</code> : Same as <code>zdim</code> except that it uses the <code>split</code> function instead of regexp search for the field in <code>file2.txt</code>.</li> </ul> <h3>Notes</h3> <ol> <li>I experimented a little bit with Gnu parallel (see <code>gregory1</code> solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file (<code>file2.txt</code>) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case where <code>file2.txt</code> is 20M, I set <code>$block_size</code> to 5M ( see <code>gregory1</code> solution above), whereas for the more realistic case presented below where <code>file2.txt</code> is 268M, a <code>$block_size</code> of 67M was used.</li> <li> The solutions <code>BOC1</code>, <code>BOC2</code>, <code>codeforester_orig</code>, <code>inian1</code>, <code>inian4</code>, <code>inian5</code>, and <code>gregory1</code> all used loose matching. Meaning that the words from <code>file1.txt</code> did not have to match exactly in field #2 of <code>file2.txt</code>. A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods called <code>BOC1B</code> and <code>BOC2B</code> used a modified <code>regexp1.txt</code> file. The lines in the original <code>regexp1.txt</code> where on the form <code>\|foo1|foo2|...|fooN\|</code> which would match the words at any field boundary. The modified file, <code>regexp1b.txt</code>, anchored the match to field #2 exclusively using the form <code>^[^|]*\|foo1|foo2|...|fooN\|</code> instead. Then the rest of the modified methods <code>codeforester_origB</code>, <code>inian1B</code>, <code>inian4B</code>, <code>inian5B</code>, and <code>gregory1B</code> used a modified <code>file1.txt</code>. Instead of a literal word per line, the modified file <code>file1b.txt</code> used one regex per line on the form: <pre class="prettyprint"><code> ^[^|]*\|word1\| ^[^|]*\|word2\| ^[^|]*\|word3\| [...] </code></pre> and in addition, <code>fgrep -f</code> was replaced by <code>grep -E -f</code> for these methods. </li> </ol> <h3>Running the tests</h3> Here is the script used for running all the tests. It uses the Bash <code>time</code> command to record the time spent for each script. Note that the <code>time</code> command returns three different times call <code>real</code>, <code>user</code>, and <code>sys</code>. First I used <code>user</code> + <code>sys</code>, but realized that this was incorrect when using Gnu parallel command, so the time reported below is now the <code>real</code> part returned by <code>time</code>. See this question for more information about the different times returned by <code>time</code>. The first test is run with <code>file1.txt</code> containing 5 lines, and <code>file2.txt</code> containing <code>1000000</code> lines. Here is the first 52 lines of the <code>run_all.pl</code> script, the rest of the script is available here. run_all.pl <pre class="prettyprint"><code>#! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Cwd; use Getopt::Long; use Data::Printer; use FGB::Common; use List::Util qw(max shuffle); use Number::Bytes::Human qw(format_bytes); use Sys::Info; GetOptions ( "verbose" => \my $verbose, "check" => \my $check, "single-case=s" => \my $case, "expected=i" => \my $expected_no_lines, ) or die("Error in command line arguments\n"); my $test_dir = 'solutions'; my $output_file = 'out.txt'; my $wc_expected = $expected_no_lines; # expected number of output lines my $tests = get_test_names( $test_dir, $case ); my $file2_size = get_file2_size(); my $num_cpus = Sys::Info->new()->device( CPU => () )->count; chdir $test_dir; my $cmd = 'run.sh'; my @times; for my $case (@$tests) { my $savedir = getcwd(); chdir $case; say "Running '$case'.."; my $arg = get_cmd_args( $case, $file2_size, $num_cpus ); my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`; my ($user, $sys, $real ) = get_run_times( $output ); print_timings( $user, $sys, $real ) if $verbose; check_output_is_ok( $output_file, $wc_expected, $verbose, $check ); print "\n" if $verbose; push @times, $real; #push @times, $user + $sys; # this is wrong when using Gnu parallel chdir $savedir; } say "Done.\n"; print_summary( $tests, \@times ); </code></pre> <h3>Results</h3> Here is the output from running the tests: <pre class="prettyprint"><code>$ run_all.pl --verbose Running 'inian3'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.00 ) ..no of output lines: 66711 Running 'inian2'.. ..finished in 0.73 seconds ( user: 0.73, sys: 0.00 ) ..no of output lines: 66711 Running 'Vasiliou'.. ..finished in 0.09 seconds ( user: 0.08, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester_orig'.. ..finished in 0.05 seconds ( user: 0.05, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.01 ) ..no of output lines: 66711 [...] </code></pre> <h3>Summary</h3> [Results obtained by @Vasiliou are shown in the middle column.] <pre class="prettyprint"><code> |Vasiliou My Benchmark |Results | Details -------------------------------|---------|---------------------- inian4 : 0.04s |0.22s | LC_ALL fgrep -f [loose] codeforester_orig : 0.05s | | fgrep -f [loose] Vasiliou2 : 0.06s |0.16s | [LC_ALL join [requires sorted files]] BOC1 : 0.06s | | grep -E [loose] BOC2 : 0.07s |15s | LC_ALL grep -E [loose] BOC2B : 0.07s | | LC_ALL grep -E [strict] inian4B : 0.08s | | LC_ALL grep -E -f [strict] Vasiliou : 0.08s |0.23s | [join [requires sorted files]] gregory1B : 0.08s | | [parallel + grep -E -f [strict]] ikegami : 0.1s | | grep -P gregory1 : 0.11s |0.5s | [parallel + fgrep -f [loose]] hakon1 : 0.14s | | [perl + c] BOC1B : 0.14s | | grep -E [strict] jjoao : 0.21s | | [compiled flex generated c code] inian6 : 0.26s |0.7s | [LC_ALL awk + split+dict] codeforester_origB : 0.28s | | grep -E -f [strict] dawg : 0.35s | | [python + split+dict] inian3 : 0.44s |1.1s | [awk + split+dict] zdim2 : 0.4s | | [perl + split+dict] codeforester : 0.45s | | [perl + split+dict] oliv : 0.5s | | [python + compiled regex + re.search()] zdim : 0.61s | | [perl + regexp+dict] inian2 : 0.73s |1.7s | [awk + index($0,i)] inian5 : 18.12s | | [LC_ALL awk + match($0,i) [loose]] inian1 : 19.46s | | [awk + match($0,i) [loose]] inian5B : 42.27s | | [LC_ALL awk + match($0,i) [strict]] inian1B : 85.67s | | [awk + match($0,i) [strict]] Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling. </code></pre> <h3>A more realistic test case</h3> I then created a more realistic case with <code>file1.txt</code> having 100 words and <code>file2.txt</code> having 10 million lines (268Mb file size). I extracted 1000 random words from the dictionary at <code>/usr/share/dict/american-english</code> using <code>shuf -n1000 /usr/share/dict/american-english > words.txt</code> then extracted 100 of these words into <code>file1.txt</code> and then constructed <code>file2.txt</code> the same way as described above for the first test case. Note that the dictionary file was UTF-8 encoded, and I stripped away all non-ASCII characters from the <code>words.txt</code>. Then I run the test without the three slowest methods from the previous case. I.e. <code>inian1</code>, <code>inian2</code>, and <code>inian5</code> were left out. Here are the new results: <pre class="prettyprint"><code>gregory1 : 0.86s | [parallel + fgrep -f [loose]] Vasiliou2 : 0.94s | [LC_ALL join [requires sorted files]] inian4B : 1.12s | LC_ALL grep -E -f [strict] BOC2B : 1.13s | LC_ALL grep -E [strict] BOC2 : 1.15s | LC_ALL grep -E [loose] BOC1 : 1.18s | grep -E [loose] ikegami : 1.33s | grep -P Vasiliou : 1.37s | [join [requires sorted files]] hakon1 : 1.44s | [perl + c] inian4 : 2.18s | LC_ALL fgrep -f [loose] codeforester_orig : 2.2s | fgrep -f [loose] inian6 : 2.82s | [LC_ALL awk + split+dict] jjoao : 3.09s | [compiled flex generated c code] dawg : 3.54s | [python + split+dict] zdim2 : 4.21s | [perl + split+dict] codeforester : 4.67s | [perl + split+dict] inian3 : 5.52s | [awk + split+dict] zdim : 6.55s | [perl + regexp+dict] gregory1B : 45.36s | [parallel + grep -E -f [strict]] oliv : 60.35s | [python + compiled regex + re.search()] BOC1B : 74.71s | grep -E [strict] codeforester_origB : 75.52s | grep -E -f [strict] </code></pre> <h3>Note</h3> The <code>grep</code> based solutions were looking for a match on the whole line, so in this case they contained some false matches: the methods <code>codeforester_orig</code>, <code>BOC1</code>, <code>BOC2</code>, <code>gregory1</code>, <code>inian4</code>, and <code>oliv</code> extracted 1,087,609 lines out of 10,000,000 lines, whereas the other methods extracted the correct 997,993 lines from <code>file2.txt</code>. <h3>Notes</h3> <ul> <li>I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)</li> <li>The whole benchmark study is available here.</li> </ul>

A Perl solution. [See Note below.] Use a hash for the first file. As you read the big file line-by-line, extract the field by regex (captures the first pattern between <code>||</code>) or <code>split</code> (gets the second word) and print if it <code>exists</code>. They likely differ in speed a bit (time them). The <code>defined</code> check isn't needed in the regex while for <code>split</code> use <code>//</code> (defined-or) that short-circuits. <pre class="prettyprint"><code>use warnings; use strict; # If 'prog smallfile bigfile' is the preferred use die "Usage: $0 smallfile bigfile\n" if @ARGV != 2; my ($smallfile, $bigfile) = @ARGV; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; open $fh, '<', $bigfile or die "Can't open $bigfile: $!"; while (<$fh>) { exists $word{ (/\|([^|]+)/)[0] } && print; # Or #exists $word{ (split /\|/)[1] // '' } && print; } close $fh; </code></pre> Avoiding the <code>if</code> branch and using short-circuit is faster, but only very little. On billions of lines these tweaks add up but again not to too much. It may (or may not) be a tad bit faster to read the small file line by line, instead of in list context as above, but this should not be noticeable. Update Writing to <code>STDOUT</code> saves two operations and I repeatedly time it to be a little faster than writing to a file. Such usage is also consistent with most UNIX tools so I changed to write to <code>STDOUT</code>. Next, the <code>exists</code> test is not needed and dropping it spares an operation. However, I consistently get a touch better runtimes with it, while it also conveys the purpose better. Altogether I am leaving it in. Thanks to ikegami for comments. Note The commented out version is about 50% faster than the other, by my benchmark below. These are both given because they are different, one finding the first match and the other the second field. I am keeping it this way as a more generic choice, since the question is ambiguous on that. <hr> Some comparisons (benchmark) [Updated for writing to <code>STDOUT</code>, see "Update" above] There is an extensive analysis in the answer by HåkonHægland, timing one run of most solutions. Here is another take, benchmarking the two solutions above, the OP's own answer, and the posted <code>fgrep</code> one, expected to be fast and used in the question and in many answers. I build test data in the following way. A handful of lines of the length roughly as shown are made with random words, for both files, so to match in the second field. Then I pad this "seed" for data samples with lines that won't match, so to mimic ratios between sizes and matches quoted by OP: for 14K lines in small file there are 1.3M lines in the big file, yielding 126K matches. Then these samples are written repeatedly to build full data files as OP's, List::Util::<code>shuffle</code>-ed each time. All runs compared below produce <code>106_120</code> matches for the above file sizes (<code>diff</code>-ed for a check), so the matching frequency is close enough. They are benchmarked by calling complete programs using <code>my $res = timethese(60 ...)</code>. The result of <code>cmpthese($res)</code> on v5.16 are <pre class="prettyprint"> Rate regex cfor split fgrep regex 1.05/s -- -23% -35% -44% cfor 1.36/s 30% -- -16% -28% split 1.62/s 54% 19% -- -14% fgrep 1.89/s 80% 39% 17% -- </pre> The fact that the optimized C-program <code>fgrep</code> comes on top isn't surprising. The lag of "regex" behind "split" may be due to the overhead of starting the engine for little matches, many times. This may vary over Perl versions, given the evolving regex engine optimizations. I include the answer of @codeforester ("cfor") since it was claimed to be fastest, and its <code>24%</code> lag behind the very similar "split" is likely due to scattered small inefficiencies (see a comment below this answer).&dagger; This isn't shatteringly different, while there are sure variations across hardware and software and over data details. I ran this on different Perls and machines, and the notable difference is that in some cases <code>fgrep</code> was indeed an order of magnitude faster. The OP's experience of very slow <code>fgrep</code> is surprising. Given their quoted run times, order of magnitude slower than the above, I'd guess that there's an old system to "blame." Even though this is completely I/O based there are concurrency benefits from putting it on multiple cores and I'd expect a good speedup, up to a factor of a few. <hr> &dagger; Alas, the comment got deleted (?). In short: unneeded use of a scalar (costs), of an <code>if</code> branch, of <code>defined</code>, of <code>printf</code> instead of <code>print</code> (slow!). These matter for runtime on 2 billion lines.

Fastest way to find lines of a file from another larger file in Bash

Tags:

linux

grep

bash

awk

perl

I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt (about 14K lines, not sorted):

foo1 foo2 ... bar1 bar2 ...

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1 date2|foo2|number2 ... date1|bar1|number1 date2|bar2|number2 ...

Output expected:

date1|foo1|number1 date2|foo2|number2 ... date1|bar1|number1 date2|bar2|number2 ...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

749

asked Feb 15 '17 01:02

codeforester

2 Answers

I have tried to do a comparison between some of the methods presented here.

First I created a Perl script to generate the input files file1.txt and file2.txt. In order to compare some of the solutions, I made sure that the the words from file1.txt only could appear in the second field in file2.txt. Also to be able to use the join solution presented by @GeorgeVasiliou, I sorted file1.txt and file2.txt. Currently I generated the input files based on only 75 random words ( taken from https://www.randomlists.com/random-words ). Only 5 of these 75 words was used in file1.txt the remaining 70 words was used to fill up the fields in file2.txt. It might be necessary to increase the number of words substantially to get realistic results ( according to the OP, the original file1.txt contained 14000 words). In the tests below I used a file2.txt with 1000000 ( 1 million ) lines. The script also generates the file regexp1.txt required by the grep solution of @BOC.

gen_input_files.pl:

#! /usr/bin/env perl use feature qw(say); use strict; use warnings;  use Data::Printer; use Getopt::Long;  GetOptions ("num_lines=i" => \my $nlines )   or die("Error in command line arguments\n");  # Generated random words from site: https://www.randomlists.com/random-words my $word_filename        = 'words.txt'; # 75 random words my $num_match_words      = 5; my $num_file2_lines      = $nlines || 1_000_000; my $file2_words_per_line = 3; my $file2_match_field_no = 2; my $file1_filename       = 'file1.txt'; my $file2_filename       = 'file2.txt'; my $file1_regex_fn       = 'regexp1.txt';  say "generating $num_file2_lines lines.."; my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words );  write_file1( $file1_filename, $words2 ); write_file2(     $file2_filename, $words1, $words2, $num_file2_lines,     $file2_words_per_line, $file2_match_field_no ); write_BOC_regexp_file( $file1_regex_fn, $words2 );   sub write_BOC_regexp_file {     my ( $fn, $words ) = @_;      open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";     print $fh '\\|' . (join "|", @$words) . '\\|';     close $fh; }  sub write_file2 {     my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_;      my $nwords1 = scalar @$words1;     my $nwords2 = scalar @$words2;     my @lines;     for (1..$nlines) {         my @words_line;         my $key;         for (1..$words_per_line) {             my $word;             if ( $_ != $field_no ) {                 my $index = int (rand $nwords1);                 $word = @{ $words1 }[$index];             }             else {                 my $index = int (rand($nwords1 + $nwords2) );                 if ( $index < $nwords2 ) {                     $word = @{ $words2 }[$index];                 }                 else {                     $word =  @{ $words1 }[$index - $nwords2];                 }                 $key = $word;             }             push @words_line, $word;         }         push @lines, [$key, (join "|", @words_line)];     }     @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines;      open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";     print $fh (join "\n", @lines);     close $fh; }  sub write_file1 {     my ( $fn, $words ) = @_;      open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";     print $fh (join "\n", sort @$words);     close $fh; }  sub get_words {     my ( $fn, $N ) = @_;      open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";     my @words = map {chomp $_; $_} <$fh>;     close $fh;      my @words1 = @words[$N..$#words];     my @words2 = @words[0..($N - 1)];     return ( \@words1, \@words2 ); }

Next, I created a sub folder solutions with all the test cases:

$ tree solutions/ solutions/ ├── BOC1 │   ├── out.txt │   └── run.sh ├── BOC2 │   ├── out.txt │   └── run.sh ├── codeforester │   ├── out.txt │   ├── run.pl │   └── run.sh [...]

Here the files out.txt is the output from the greps for each solution. The scripts run.sh runs the solution for the given test case.

Notes on the different solutions

BOC1 : First solution presented by @BOC
```
grep -E -f regexp1.txt file2.txt 
```

BOC2 : Second solution suggested by @BOC:

LC_ALL=C grep -E -f regexp1.txt file2.txt

codeforester : Accepted Perl solution by @codeforester ( see source )
codeforester_orig : Original solution presented by @codeforested:
```
fgrep -f file1.txt file2.txt 
```
dawg : Python solution using dictionary and split line proposed by @dawg ( see source )
gregory1 : solution using Gnu Parallel suggested by @gregory
```
parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt 
```
See note below regarding how to choose $block_size.
hakon1 : Perl solution provided by @HåkonHægland (see source). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation when file1.txt or file2.txt changes. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.
ikegami : Solution using assembled regexp and using grep -P as given by @ikegami. Note: The assembled regexp was written to a separate file regexp_ikegami.txt, so the runtime of generating the regexp is not included in the comparison below. This is the code used:
```
regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt 
```

inian1 : First solution by @Inian using match()

awk 'FNR==NR{     hash[$1]; next } {    for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian2 : Second solution by @Inian using index()

awk 'FNR==NR{     hash[$1]; next } {    for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian3 : Third solution by @Inian checking only $2 field:

awk 'FNR==NR{     hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt

inian4 : 4th soultion by @Inian ( basically the same as codeforester_orig with LC_ALL ) :
```
LC_ALL=C fgrep -f file1.txt file2.txt 
```

inian5 : 5th solution by @Inian (same as inian1 but with LC_ALL ):

LC_ALL=C awk 'FNR==NR{     hash[$1]; next } {    for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt

inian6 : Same as inian3 but with LC_ALL=C. Thanks to @GeorgeVasiliou for suggestion.
jjoao : Compiled flex-generated C code as proposed by @JJoao (see source ). Note: Recompilation of the exectuable must be done each time file1.txt changes. The time used to compile the executable is not included in the run times presented below.
oliv : Python script provided by @oliv ( see source )

Vasiliou : Using join as suggested by @GeorgeVasiliou:

join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt

Vasiliou2 : Same as Vasiliou but with LC_ALL=C.
zdim : Using Perl script provided by @zdim ( see source ). Note: This uses the regexp search version ( instead of split line solution ).
zdim2 : Same as zdim except that it uses the split function instead of regexp search for the field in file2.txt.

Notes

I experimented a little bit with Gnu parallel (see gregory1 solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file (file2.txt) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case where file2.txt is 20M, I set $block_size to 5M ( see gregory1 solution above), whereas for the more realistic case presented below where file2.txt is 268M, a $block_size of 67M was used.
The solutions BOC1, BOC2, codeforester_orig, inian1, inian4, inian5, and gregory1 all used loose matching. Meaning that the words from file1.txt did not have to match exactly in field #2 of file2.txt. A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods called BOC1B and BOC2B used a modified regexp1.txt file. The lines in the original regexp1.txt where on the form \|foo1|foo2|...|fooN\| which would match the words at any field boundary. The modified file, regexp1b.txt, anchored the match to field #2 exclusively using the form ^[^|]*\|foo1|foo2|...|fooN\| instead.

Then the rest of the modified methods codeforester_origB, inian1B, inian4B, inian5B, and gregory1B used a modified file1.txt. Instead of a literal word per line, the modified file file1b.txt used one regex per line on the form:
```
 ^[^|]*\|word1\|  ^[^|]*\|word2\|  ^[^|]*\|word3\|  [...] 
```
and in addition, fgrep -f was replaced by grep -E -f for these methods.

Running the tests

Here is the script used for running all the tests. It uses the Bash time command to record the time spent for each script. Note that the time command returns three different times call real, user, and sys. First I used user + sys, but realized that this was incorrect when using Gnu parallel command, so the time reported below is now the real part returned by time. See this question for more information about the different times returned by time.

The first test is run with file1.txt containing 5 lines, and file2.txt containing 1000000 lines. Here is the first 52 lines of the run_all.pl script, the rest of the script is available here.

run_all.pl

#! /usr/bin/env perl  use feature qw(say); use strict; use warnings;  use Cwd; use Getopt::Long; use Data::Printer; use FGB::Common; use List::Util qw(max shuffle); use Number::Bytes::Human qw(format_bytes); use Sys::Info;  GetOptions (     "verbose"       => \my $verbose,     "check"         => \my $check,     "single-case=s" => \my $case,     "expected=i"    => \my $expected_no_lines, ) or die("Error in command line arguments\n");  my $test_dir    = 'solutions'; my $output_file = 'out.txt'; my $wc_expected = $expected_no_lines; # expected number of output lines  my $tests       = get_test_names( $test_dir, $case );  my $file2_size  = get_file2_size(); my $num_cpus    = Sys::Info->new()->device( CPU => () )->count;  chdir $test_dir; my $cmd = 'run.sh'; my @times; for my $case (@$tests) {     my $savedir = getcwd();     chdir $case;     say "Running '$case'..";     my $arg = get_cmd_args( $case, $file2_size, $num_cpus );     my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`;     my ($user, $sys, $real ) = get_run_times( $output );     print_timings( $user, $sys, $real ) if $verbose;     check_output_is_ok( $output_file, $wc_expected, $verbose, $check );     print "\n" if $verbose;     push @times, $real;     #push @times, $user + $sys; # this is wrong when using Gnu parallel     chdir $savedir; }  say "Done.\n";  print_summary( $tests, \@times );

Results

Here is the output from running the tests:

$  run_all.pl --verbose Running 'inian3'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.00 ) ..no of output lines: 66711  Running 'inian2'.. ..finished in 0.73 seconds ( user: 0.73, sys: 0.00 ) ..no of output lines: 66711  Running 'Vasiliou'.. ..finished in 0.09 seconds ( user: 0.08, sys: 0.00 ) ..no of output lines: 66711  Running 'codeforester_orig'.. ..finished in 0.05 seconds ( user: 0.05, sys: 0.00 ) ..no of output lines: 66711  Running 'codeforester'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.01 ) ..no of output lines: 66711  [...]

Summary

[Results obtained by @Vasiliou are shown in the middle column.]

                               |Vasiliou My Benchmark                   |Results  |   Details -------------------------------|---------|---------------------- inian4             : 0.04s     |0.22s    | LC_ALL fgrep -f [loose]  codeforester_orig  : 0.05s     |         | fgrep -f [loose] Vasiliou2          : 0.06s     |0.16s    | [LC_ALL join [requires sorted files]] BOC1               : 0.06s     |         | grep -E [loose]  BOC2               : 0.07s     |15s      | LC_ALL grep -E [loose]  BOC2B              : 0.07s     |         | LC_ALL grep -E [strict]  inian4B            : 0.08s     |         | LC_ALL grep -E -f [strict]  Vasiliou           : 0.08s     |0.23s    | [join [requires sorted files]]  gregory1B          : 0.08s     |         | [parallel + grep -E -f [strict]]  ikegami            : 0.1s      |         | grep -P  gregory1           : 0.11s     |0.5s     | [parallel + fgrep -f [loose]]  hakon1             : 0.14s     |         | [perl + c] BOC1B              : 0.14s     |         | grep -E [strict]  jjoao              : 0.21s     |         | [compiled flex generated c code]  inian6             : 0.26s     |0.7s     | [LC_ALL awk + split+dict]  codeforester_origB : 0.28s     |         | grep -E -f [strict]  dawg               : 0.35s     |         | [python + split+dict]  inian3             : 0.44s     |1.1s     | [awk + split+dict]  zdim2              : 0.4s      |         | [perl + split+dict]  codeforester       : 0.45s     |         | [perl + split+dict]  oliv               : 0.5s      |         | [python + compiled regex + re.search()]  zdim               : 0.61s     |         | [perl + regexp+dict]  inian2             : 0.73s     |1.7s     | [awk + index($0,i)]  inian5             : 18.12s    |         | [LC_ALL awk + match($0,i) [loose]]  inian1             : 19.46s    |         | [awk + match($0,i) [loose]]  inian5B            : 42.27s    |         | [LC_ALL awk + match($0,i) [strict]]  inian1B            : 85.67s    |         | [awk + match($0,i) [strict]]   Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.

A more realistic test case

I then created a more realistic case with file1.txt having 100 words and file2.txt having 10 million lines (268Mb file size). I extracted 1000 random words from the dictionary at /usr/share/dict/american-english using shuf -n1000 /usr/share/dict/american-english > words.txt then extracted 100 of these words into file1.txt and then constructed file2.txt the same way as described above for the first test case. Note that the dictionary file was UTF-8 encoded, and I stripped away all non-ASCII characters from the words.txt.

Then I run the test without the three slowest methods from the previous case. I.e. inian1, inian2, and inian5 were left out. Here are the new results:

gregory1           : 0.86s     | [parallel + fgrep -f [loose]] Vasiliou2          : 0.94s     | [LC_ALL join [requires sorted files]] inian4B            : 1.12s     | LC_ALL grep -E -f [strict]  BOC2B              : 1.13s     | LC_ALL grep -E [strict]  BOC2               : 1.15s     | LC_ALL grep -E [loose]  BOC1               : 1.18s     | grep -E [loose]  ikegami            : 1.33s     | grep -P  Vasiliou           : 1.37s     | [join [requires sorted files]] hakon1             : 1.44s     | [perl + c] inian4             : 2.18s     | LC_ALL fgrep -f [loose]  codeforester_orig  : 2.2s      | fgrep -f [loose]  inian6             : 2.82s     | [LC_ALL awk + split+dict]  jjoao              : 3.09s     | [compiled flex generated c code]  dawg               : 3.54s     | [python + split+dict]  zdim2              : 4.21s     | [perl + split+dict] codeforester       : 4.67s     | [perl + split+dict]  inian3             : 5.52s     | [awk + split+dict]  zdim               : 6.55s     | [perl + regexp+dict]  gregory1B          : 45.36s    | [parallel + grep -E -f [strict]]  oliv               : 60.35s    | [python + compiled regex + re.search()]  BOC1B              : 74.71s    | grep -E [strict]  codeforester_origB : 75.52s    | grep -E -f [strict]

Note

The grep based solutions were looking for a match on the whole line, so in this case they contained some false matches: the methods codeforester_orig, BOC1, BOC2, gregory1, inian4, and oliv extracted 1,087,609 lines out of 10,000,000 lines, whereas the other methods extracted the correct 997,993 lines from file2.txt.

Notes

I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)
The whole benchmark study is available here.

answered Oct 05 '22 21:10

Håkon Hægland

A Perl solution. [See Note below.]

Use a hash for the first file. As you read the big file line-by-line, extract the field by regex (captures the first pattern between ||) or split (gets the second word) and print if it exists. They likely differ in speed a bit (time them). The defined check isn't needed in the regex while for split use // (defined-or) that short-circuits.

use warnings; use strict;  # If 'prog smallfile bigfile' is the preferred use die "Usage: $0 smallfile bigfile\n"  if @ARGV != 2; my ($smallfile, $bigfile) = @ARGV;  open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";     my %word = map { chomp; $_ => 1 } <$fh>;  open    $fh, '<', $bigfile or die "Can't open $bigfile: $!";        while (<$fh>)  {     exists $word{ (/\|([^|]+)/)[0] } && print;        # Or     #exists $word{ (split /\|/)[1] // '' } && print; } close $fh;

Avoiding the if branch and using short-circuit is faster, but only very little. On billions of lines these tweaks add up but again not to too much. It may (or may not) be a tad bit faster to read the small file line by line, instead of in list context as above, but this should not be noticeable.

Update Writing to STDOUT saves two operations and I repeatedly time it to be a little faster than writing to a file. Such usage is also consistent with most UNIX tools so I changed to write to STDOUT. Next, the exists test is not needed and dropping it spares an operation. However, I consistently get a touch better runtimes with it, while it also conveys the purpose better. Altogether I am leaving it in. Thanks to ikegami for comments.

Note The commented out version is about 50% faster than the other, by my benchmark below. These are both given because they are different, one finding the first match and the other the second field. I am keeping it this way as a more generic choice, since the question is ambiguous on that.

Some comparisons (benchmark) [Updated for writing to STDOUT, see "Update" above]

There is an extensive analysis in the answer by HåkonHægland, timing one run of most solutions. Here is another take, benchmarking the two solutions above, the OP's own answer, and the posted fgrep one, expected to be fast and used in the question and in many answers.

I build test data in the following way. A handful of lines of the length roughly as shown are made with random words, for both files, so to match in the second field. Then I pad this "seed" for data samples with lines that won't match, so to mimic ratios between sizes and matches quoted by OP: for 14K lines in small file there are 1.3M lines in the big file, yielding 126K matches. Then these samples are written repeatedly to build full data files as OP's, List::Util::shuffle-ed each time.

All runs compared below produce 106_120 matches for the above file sizes (diff-ed for a check), so the matching frequency is close enough. They are benchmarked by calling complete programs using my $res = timethese(60 ...). The result of cmpthese($res) on v5.16 are

         Rate regex  cfor split fgrep regex 1.05/s    --  -23%  -35%  -44% cfor  1.36/s   30%    --  -16%  -28% split 1.62/s   54%   19%    --  -14% fgrep 1.89/s   80%   39%   17%    --

The fact that the optimized C-program fgrep comes on top isn't surprising. The lag of "regex" behind "split" may be due to the overhead of starting the engine for little matches, many times. This may vary over Perl versions, given the evolving regex engine optimizations. I include the answer of @codeforester ("cfor") since it was claimed to be fastest, and its 24% lag behind the very similar "split" is likely due to scattered small inefficiencies (see a comment below this answer).^†

This isn't shatteringly different, while there are sure variations across hardware and software and over data details. I ran this on different Perls and machines, and the notable difference is that in some cases fgrep was indeed an order of magnitude faster.

The OP's experience of very slow fgrep is surprising. Given their quoted run times, order of magnitude slower than the above, I'd guess that there's an old system to "blame."

Even though this is completely I/O based there are concurrency benefits from putting it on multiple cores and I'd expect a good speedup, up to a factor of a few.

^† Alas, the comment got deleted (?). In short: unneeded use of a scalar (costs), of an if branch, of defined, of printf instead of print (slow!). These matter for runtime on 2 billion lines.

answered Oct 05 '22 23:10

zdim

Related questions
                            
                                Why MUST detach from tty when writing a linux daemon?
                            
                                Getting the machine serial number and CPU ID using C/C++ in Linux
                            
                                Bash script to find the frequency of every letter in a file
                            
                                How to change sender name (not email address) when using the linux mail command for autosending mail? [closed]
                            
                                C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)
                            
                                What is the reason and how to avoid the [FIN, ACK] , [RST] and [RST, ACK]
                            
                                How to log output in bash and see it in the terminal at the same time?
                            
                                how to get MouseMove and MouseClick in bash?
                            
                                Shell script to get the process ID on Linux [duplicate]
                            
                                Remove whitespaces from filenames in Linux [closed]
                            
                                Get Month & Day from Date
                            
                                Find out if file has been modified within the last 2 minutes
                            
                                Does Linux guarantee the contents of a file is flushed to disc after close()?
                            
                                How to use POSIX semaphores on forked processes in C?
                            
                                How to execute bash command with sudo privileges in Java?
                            
                                Do line endings differ between Windows and Linux? [closed]
                            
                                How do you clear your Visual Studio Code cache on a Mac/Linux Machine?
                            
                                Run all shell scripts in folder
                            
                                How to scp to Amazon s3?
                            
                                Add user to group but not reflected when run "id"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to find lines of a file from another larger file in Bash

Tags:

linux

grep

bash

awk

perl

codeforester

People also ask

2 Answers

Notes on the different solutions

Notes

Running the tests

Results

Summary

A more realistic test case

Note

Notes

Håkon Hægland

zdim

Recent Activity

Donate For Us