I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number. <pre class="prettyprint"><code>comm -12 file1 file2 </code></pre> gives me the common lines, but I want to know the line number from the individual files as well

This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines. <pre class="prettyprint"><code>tab=` printf '\t' ` join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 ) </code></pre> This works because <code>cat -n</code> prepends a space-padded number followed by a tab character to each line. The <code>join</code> then finds the common lines looking only at the stuff after the first tab. After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2. Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character. <pre class="prettyprint"><code>safe="|" join -t"$safe" -j2 \ <( cat -n file1 | sed -e "s:\t:$safe:" ) \ <( cat -n file2 | sed -e "s:\t:$safe:" ) </code></pre> Also, depending on how <code>join</code> is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.

Find common lines between two files and also their line number

Tags:

bash

shell

I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number.

comm -12 file1 file2

gives me the common lines, but I want to know the line number from the individual files as well

342

asked Dec 20 '13 13:12

user31641

1 Answers

This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines.

tab=` printf '\t' `
join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 )

This works because cat -n prepends a space-padded number followed by a tab character to each line. The join then finds the common lines looking only at the stuff after the first tab.

After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2.

Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character.

safe="|"
join -t"$safe" -j2 \
  <( cat -n file1 | sed -e "s:\t:$safe:" ) \
  <( cat -n file2 | sed -e "s:\t:$safe:" )

Also, depending on how join is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.

148

answered Oct 13 '22 01:10

carl.anderson

Related questions
                            
                                linux: m3u8: how to manually download and create an mp4 from segments listed in .m3u8
                            
                                git !alias that would work in both bash and Powershell
                            
                                cloud foundry copy routes from one app to another
                            
                                How to fix "ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData:"
                            
                                How to run "conda ***" in a system command in R
                            
                                Bash - Clearing the last output correctly
                            
                                How is bash scripting affected by the bash4 release?
                            
                                Coloring/indenting a script within script in Emacs
                            
                                How to use the same bash variable between a parent shell and child shell
                            
                                Incorporating bash scripts into an R package?
                            
                                Executing bash with subprocess.Popen
                            
                                Is it possible to accept user input as part of a remote git post-receive hook?
                            
                                Letting other users stop/restart simple bash daemons – use signals or what?
                            
                                linux pipe with multiple programs asking for user input
                            
                                Logging to a non blocking named pipe?
                            
                                monit removes quotes from start program commands
                            
                                Problems compiling ffmpeg on windows using cygwin
                            
                                combine history across tty
                            
                                svn status | sort - does not sort the output
                            
                                OpenCV Python Linker Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With