I have three files as shown below file1.txt <pre class="prettyprint"><code>"aba" 0 0 "aba" 0 0 1 "abc" 0 1 "abd" 1 1 "xxx" 0 0 </code></pre> file2.txt <pre class="prettyprint"><code>"xyz" 0 0 "aba" 0 0 0 0 "aba" 0 0 0 1 "xxx" 0 0 "abc" 1 1 </code></pre> file3.txt <pre class="prettyprint"><code>"xyx" 0 0 "aba" 0 0 "aba" 0 1 0 "xxx" 0 0 0 1 "abc" 1 1 </code></pre> I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like <pre class="prettyprint"><code>awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt </code></pre> But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help? With the current awk solution, the output ignores the duplicate key columns and gives the output as <pre class="prettyprint"><code>"xxx" 0 0 </code></pre> If we assume the output comes from file1.txt, the expected output is: <pre class="prettyprint"><code>"aba" 0 0 "aba" 0 0 1 "xxx" 0 0 </code></pre> i.e it should get the rows with duplicate key columns as well.

This python script will list out the common lines among all files : <pre class="prettyprint"><code>import sys i,l = 0,[] for files in sys.argv[1:]: l.append(set()) for line in open(files): l[i].add(" ".join(line.split()[0:2])) i+=1 commonFields = reduce(lambda s1, s2: s1 & s2, l) for files in sys.argv[1:]: print "Common lines in ",files for line in open(files): for fields in commonFields: if fields in line: print line, break </code></pre> Usage : python script.py file1 file2 file3 ...

find common elements in >2 files

Tags:

compare

awk

I have three files as shown below

file1.txt

"aba" 0 0 
"aba" 0 0 1
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

file2.txt

"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1

file3.txt

"xyx" 0 0
"aba" 0 0 
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1

I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like

awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt

But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?

With the current awk solution, the output ignores the duplicate key columns and gives the output as

"xxx" 0 0

If we assume the output comes from file1.txt, the expected output is:

"aba" 0 0 
"aba" 0 0 1
"xxx" 0 0

i.e it should get the rows with duplicate key columns as well.

496

asked Jun 05 '13 09:06

chas

2 Answers

This python script will list out the common lines among all files :

import sys
i,l = 0,[]
for files in sys.argv[1:]:
  l.append(set())
  for line in open(files): l[i].add(" ".join(line.split()[0:2]))
  i+=1
commonFields =  reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
  print "Common lines in ",files
  for line in open(files):
    for fields in commonFields:
      if fields in line:
        print line,
        break

Usage : python script.py file1 file2 file3 ...

answered Sep 25 '22 21:09

Sidharth C. Nadhan

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0

EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}

NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

160

answered Sep 23 '22 21:09

Birei

Related questions
                            
                                find ip address of my system for a particular interface with shell script (bash)
                            
                                Remove all text from last dot in bash
                            
                                Fill placeholders in file in single pass
                            
                                get specific lines from a repeated range pattern in a text file
                            
                                How to get percentage of processor use with bash?
                            
                                Add additional fields based on field count
                            
                                Extracting directory name from an absolute path using sed or awk
                            
                                How to print variable inside awk
                            
                                move line which matches pattern to previous line
                            
                                AWK Print Second Column of Last Line
                            
                                AWK: execute CURL on each line and parse result
                            
                                awk unix - match regex - regex string size limit | ideas?
                            
                                UNIX shell: sort a string by word length and by ASCII order ignoring case
                            
                                Postgres 9.3 end-of-copy marker corrupt - Any way to change this setting?
                            
                                Bash comprehensive list of IP addresses for a domain
                            
                                Plotting a function directly from a text file
                            
                                Filter out values less than a threshold from a CSV file
                            
                                awk with dates before 1970
                            
                                ipython shell awk : Escaping "$" sign
                            
                                SVN : Add colors on command-line svn with awk (in bash)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With