Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find common elements in >2 files

Tags:

compare

awk

I have three files as shown below

file1.txt

"aba" 0 0 
"aba" 0 0 1
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

file2.txt

"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1

file3.txt

"xyx" 0 0
"aba" 0 0 
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1

I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like

awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?

With the current awk solution, the output ignores the duplicate key columns and gives the output as

"xxx" 0 0

If we assume the output comes from file1.txt, the expected output is:

"aba" 0 0 
"aba" 0 0 1
"xxx" 0 0 

i.e it should get the rows with duplicate key columns as well.

like image 496
chas Avatar asked Jun 05 '13 09:06

chas


People also ask

How do I find the common in two files?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.

Which command is used to identify common lines in two files?

Use comm command; it compare two sorted files line by line. With no options, produce three column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

How do I display the contents of two files in UNIX?

You can also use the cat command to display the contents of one or more files on your screen. Combining the cat command with the pg command allows you to read the contents of a file one full screen at a time. You can also display the contents of files by using input and output redirection.


2 Answers

This python script will list out the common lines among all files :

import sys
i,l = 0,[]
for files in sys.argv[1:]:
  l.append(set())
  for line in open(files): l[i].add(" ".join(line.split()[0:2]))
  i+=1
commonFields =  reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
  print "Common lines in ",files
  for line in open(files):
    for fields in commonFields:
      if fields in line:
        print line,
        break

Usage : python script.py file1 file2 file3 ...

like image 29
Sidharth C. Nadhan Avatar answered Sep 25 '22 21:09

Sidharth C. Nadhan


Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0

EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}

NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1
like image 160
Birei Avatar answered Sep 23 '22 21:09

Birei