Output whole line once for each unique value of a column (Bash)

Question

This must surely be a trivial task with awk or otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "	", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

flolo · Accepted Answer

Just use sort:

sort -k 2,2 -u file

The -u removes duplicate entries (as you wanted), and the -k 2,2 makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

Output whole line once for each unique value of a column (Bash)

Tags:

bash

shell

awk

uniq

Bede Constantinides

1 Answers

flolo

Recent Activity

Donate For Us

Output whole line once for each unique value of a column (Bash)

Tags:

bash

shell

awk

uniq

Bede Constantinides

1 Answers

flolo

Related questions

Recent Activity

Donate For Us