Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting equal lines in two files

Tags:

bash

awk

Say, I have two files and want to find out how many equal lines they have. For example, file1 is

1
3
2
4
5
0
10

and file2 contains

3
10
5
64
15

In this case the answer should be 3 (common lines are '3', '10' and '5').

This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:

 cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l

It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.

P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.

UPD: Files do not have duplicate lines

like image 361
mikhail Avatar asked Aug 13 '14 10:08

mikhail


1 Answers

To find lines in common with your 2 files, using awk :

awk 'a[$0]++' file1 file2

Will output 3 10 15

Now, just pipe this to wc to get the number of common lines :

awk 'a[$0]++' file1 file2 | wc -l

Will output 3.

Explanation:

Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will adds 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.

By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.

Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).

Note: this solution counts duplicates, i.e if you have:

file1 | file2
1     | 3
2     | 3
3     | 3

awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3

like image 165
Aserre Avatar answered Oct 17 '22 01:10

Aserre