Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting unique values between 2 sets/files

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

1 2 3 4 5 

text file 2 contains:

6 7 1 2 3 4 

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

How do I do this from the command line?

many thanks!

like image 883
mark Avatar asked Jan 17 '11 19:01

mark


People also ask

How do I get unique lines between two files in Linux?

The 'uniq' command with the '-u' option only outputs lines that are uniq. If two or more of the same ID appear on neighboring lines in the input, nothing is output. The 'uniq' command with the '-d' flag only outputs a line if it is repeated two or more times in the input.

How do you filter unique values in Unix?

The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields.


2 Answers

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2 6 7 

Explanation of how the code works:

  • If we're working on file1, track each line of text we see.
  • If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

  • FNR is the current file's record number
  • NR is the current overall record number from all input files
  • FNR==NR is true only when we are reading file1
  • $0 is the current line of text
  • a[$0] is a hash with the key set to the current line of text
  • a[$0]++ tracks that we've seen the current line of text
  • !($0 in a) is true only when we have not seen the line text
  • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
like image 54
SiegeX Avatar answered Sep 20 '22 00:09

SiegeX


Using some lesser-known utilities:

sort file1 > file1.sorted sort file2 > file2.sorted comm -1 -3 file1.sorted file2.sorted 

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted sort file2 | uniq > file2.sorted comm -1 -3 file1.sorted file2.sorted 

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

like image 32
Daniel Gallagher Avatar answered Sep 22 '22 00:09

Daniel Gallagher