I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number.
comm -12 file1 file2
gives me the common lines, but I want to know the line number from the individual files as well
Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.
Use the diff command to compare text files. It can compare single files or the contents of directories. When the diff command is run on regular files, and when it compares text files in different directories, the diff command tells which lines must be changed in the files so that they match.
The wc command is used to find the number of lines, characters, words, and bytes of a file. To find the number of lines using wc, we add the -l option. This will give us the total number of lines and the name of the file.
1. Which command is used for comparing two files? diff command is used for converting one file into another in order to make them identical and comm is used for displaying the common elements in both the files. 2.
This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines.
tab=` printf '\t' `
join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 )
This works because cat -n
prepends a space-padded number followed by a tab character to each line. The join
then finds the common lines looking only at the stuff after the first tab.
After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2.
Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character.
safe="|"
join -t"$safe" -j2 \
<( cat -n file1 | sed -e "s:\t:$safe:" ) \
<( cat -n file2 | sed -e "s:\t:$safe:" )
Also, depending on how join
is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With