I two large files (27k lines and 450k lines). They look sort of like: <pre class="prettyprint"><code>File1: 1 2 A 5 3 2 B 7 6 3 C 8 ... File2: 4 2 C 5 7 2 B 7 6 8 B 8 7 7 F 9 ... </code></pre> I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded): <pre class="prettyprint"><code>OUTPUT: 3 2 B 7 6 3 C 8 4 2 C 5 7 2 B 7 6 8 B 8 </code></pre> whats the best way?

first we sort the files on the third field : <pre class="prettyprint"><code>sort -k 3 file1 > file1.sorted sort -k 3 file2 > file2.sorted </code></pre> then we get common values on the 3rd field using comm : <pre class="prettyprint"><code>comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field </code></pre> now we can join each sorted file on the common values : <pre class="prettyprint"><code>join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined </code></pre> output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <code><( )</code> works with bash, for other shells you might use temp files instead.

Intersection of files

Tags:

file

algorithm

unix

intersection

I two large files (27k lines and 450k lines). They look sort of like:

File1:
1 2 A 5
3 2 B 7
6 3 C 8
...

File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
...

I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):

OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

whats the best way?

315

asked Sep 15 '12 22:09

bdeonovic

2 Answers

first we sort the files on the third field :

sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted

then we get common values on the 3rd field using comm :

comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

now we can join each sorted file on the common values :

join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <( ) works with bash, for other shells you might use temp files instead.

110

answered Sep 28 '22 02:09

Kwariz

Here's an option using grep, sed and cut.

Extract column 3:

cut -d' ' -f3 file1 > f1c
cut -d' ' -f3 file2 > f2c

Find matching lines in file1:

grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1  > out

Find matching lines in file2:

grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out

Output:

Update

If you have asymmetric data files and the smaller one fits into memory, this one-pass awk solution would be pretty efficient:

parse.awk

FNR == NR {
  a[$3] = $0
  p[$3] = 1
  next
}  

a[$3]

p[$3] {
  print a[$3]
  delete p[$3]
}

Run it like this:

awk -f parse.awk file1 file2

Where file1 is the smaller of the two.

Explanation

The FNR == NR block reads file1 into two hashes.
a[$3] prints file2 line if $3 is a key in a.
p[$3] prints file1 line if $3 is a key in p and deletes the key (only print once).

answered Sep 28 '22 01:09

Thor

Related questions
                            
                                question on karatsuba multiplication
                            
                                O(n) Algorithm to find if 2 arrays have 2 elements that add up to a number
                            
                                Xnary (like binary but different) counting
                            
                                Algorithm for bit expansion/duplication?
                            
                                Experience to Level Algorithm PHP [closed]
                            
                                Distribute points on mesh according to density
                            
                                Dynamic Nested Loop
                            
                                Determine the combinations of making change for a given amount
                            
                                Javascript Tree Traversal Algorithm
                            
                                Replace duplicate numbers with unique numbers from 0-(N-1)
                            
                                Generating "own" Fibonacci sequence
                            
                                Sorting algorithm with Qt/C++ - sort a QList of struct
                            
                                What are some practical applications of XOR in algorithms [closed]
                            
                                Why does this code work for this TopCoder prob?
                            
                                Algorithm to retrieve every possible combination of sublists of a two lists
                            
                                O(n) time non-recursive procedure to traverse a binary tree
                            
                                nth term of series
                            
                                How do programmers ensure that compilers create correct code?
                            
                                A variation of the "Find common ancestor"
                            
                                Complexity of a double for loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With