Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find common lines between two files and also their line number

Tags:

bash

shell

I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number.

comm -12 file1 file2

gives me the common lines, but I want to know the line number from the individual files as well

like image 342
user31641 Avatar asked Dec 20 '13 13:12

user31641


People also ask

How do I find the common line between two files?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.

Which command is used to compare two files line by line?

Use the diff command to compare text files. It can compare single files or the contents of directories. When the diff command is run on regular files, and when it compares text files in different directories, the diff command tells which lines must be changed in the files so that they match.

How do I find the number of lines in a file?

The wc command is used to find the number of lines, characters, words, and bytes of a file. To find the number of lines using wc, we add the -l option. This will give us the total number of lines and the name of the file.

Which command is used to display common and uncommon records from two files?

1. Which command is used for comparing two files? diff command is used for converting one file into another in order to make them identical and comm is used for displaying the common elements in both the files. 2.


1 Answers

This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines.

tab=` printf '\t' `
join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 )

This works because cat -n prepends a space-padded number followed by a tab character to each line. The join then finds the common lines looking only at the stuff after the first tab.

After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2.

Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character.

safe="|"
join -t"$safe" -j2 \
  <( cat -n file1 | sed -e "s:\t:$safe:" ) \
  <( cat -n file2 | sed -e "s:\t:$safe:" )

Also, depending on how join is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.

like image 148
carl.anderson Avatar answered Oct 13 '22 01:10

carl.anderson